NEC Research Institute Technical Report | 1997 |
Abstract: A highly-efficient ANSI-C facility is described for intelligently comparing a query string with a series of database strings. The bipartite weighted matching approach taken tolerates ordering violations that are problematic for simple automaton or string edit distance methods---yet common in practice. The method is character and polygraph based and does not require that words are properly formed in a query. Database characters are processed at a rate of approximately 2.5 million per second using a 200MHz Pentium Pro processor. A subroutine-level API is described along with an simple executable utility supporting both command-line and Web interfaces. An optimized Web interface is also reported consisting of a daemon that preloads multiple databases, and a corresponding CGI stub. The daemon may be initiated manually or via inetd.
Keywords: String Comparison/Similarity, Text/Database Search/Retrieval, Bipartite Matching/Assignment, Edit Distance.