|NEC Research Institute Technical Report||1995|
Abstract: Approximate string comparison and search is an important part of applications that range from natural language to the interpretation of DNA. This paper presents a bipartite weighted graph matching approach to these problems, based on the authors' linear time matching algorithms (SODA94 pp. 65-76). Our approach's tolerance to permutation of symbols or blocks, distinguishes it from the widely used edit distance and finite state machine methods. A close relationship with the earlier related proximity comparison method is established.
Under the linear cost model, a simple O(1) time per position online algorithm is presented for comparing two strings given a fixed alignment. Heuristics are given for optimal alignment. In the approximate string search problem, one string advances in a fixed direction relative to the other with each time step. We introduce a new online algorithm for this setting which dynamically maintains an optimal bipartite weighted matching.
We discuss the application of our algorithms to natural language text search, including prefilters to improve efficiency, and the use of polygraphic symbols to improve search quality. Our approach is used in the LikeIt text search utility now under development. Its overall design and objectives are summarized.
Keywords: Approximate String Comparison, Approximate String Search, Text Search, Sequence Comparison, Bipartite Quasi-Convex Matching, Distance Metric, Natural Language Processing.