levenshtein.scm: Levenshtein Distance Metric in Scheme ****************************************************** Version 0.4, 2005-07-10, `http://www.neilvandyke.org/levenshtein-scm/' by Neil W. Van Dyke <neil@neilvandyke.org> Copyright (C) 2004 - 2005 Neil W. Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See <http://www.gnu.org/copyleft/lesser.html> for details. For other license options and consulting, contact the author. Introduction ************ This is a Scheme implementation of the "Levenshtein Distance" algorithm, which is an "edit distance" metric of string similarity, due to Vladimir Levenshtein. The Levenshtein Distance is a function of two strings that represents a count of single-character insertions, deletions, and substitions that will change the first string to the second. More information is available in NIST DADS (http://www.nist.gov/dads/HTML/Levenshtein.html) and the Michael Gilleland article, "Levenshtein Distance in Three Flavors (http://www.merriampark.com/ld.htm)." This implementation is modeled after a space-efficient Perl implementation (http://www.mgilleland.com/ld/ldperl2.htm) by Jorge Mas Trullenque. It has been written in R5RS Scheme (with an `error' procedure such as the one in SRFI-23), and extended to support heterogeneous combinations of Scheme types (strings, lists, vectors), user-supplied predicate functions, and optionally reusable scratch vectors. Basic Comparisons ***************** In the current implementation, all comparisons are done internally via vectors. > (vector-levenshtein/predicate/get-scratch a b pred) get-scratch Few, if any, programs will use this procedure directly. This is like `vector-levenshtein/predicate', but allows GET-SCRATCH to be specified. GET-SCRATCH is a procedure of one term, n, that yields a vector of length n or greater, which is used for record-keeping during execution of the Levenshtein algorithm. `make-vector' can be used for GET-SCRATCH, although some programs comparing a large size or quantity of vectors may wish to reuse a record-keeping vector, rather than each time allocating a new one that will need to be garbage-collected. > (vector-levenshtein/predicate a b pred) > (vector-levenshtein/eq a b) > (vector-levenshtein/eqv a b) > (vector-levenshtein/equal a b) > (vector-levenshtein a b) Calculate the Levenshtein Distance of vectors A and B. PRED is the predicate procedure for determining if two elements are equal. The `/eq', `/eqv', and `/equal' variants correspond to the standard equivalence predicates, `eq?', `eqv?', and `equal?'. `vector-levenshtein' is an alias for `vector-levenshtein/equal'. (vector-levenshtein '#(6 6 6) '#(6 35 6 24 6 32)) => 3 > (list-levenshtein/predicate a b pred) > (list-levenshtein/eq a b) > (list-levenshtein/eqv a b) > (list-levenshtein/equal a b) > (list-levenshtein a b) Calculate the Levenshtein Distance of lists A and B. PRED is the predicate procedure for determining if two elements are equal. The `/eq', `/eqv', and `/equal' variants correspond to the standard equivalence predicates, `eq?', `eqv?', and `equal?'. `list-levenshtein' is an alias for `list-levenshtein/equal'. Note that comparison of lists is less efficient than comparison of vectors. (list-levenshtein/eq '(b c e x f y) '(a b c d e f)) => 4 > (string-levenshtein a b) Calculate the Levenshtein Distance of strings A and B. (string-levenshtein "adresse" "address") => 2 Type-Coercing Comparisons ************************* Procedures `levenshtein' and `levenshtein/predicate' provide a convenient interface for comparing a combination of vectors, lists, and strings, the types of which might not be known until runtime. > (levenshtein/predicate a b pred) Calculates the Levenshtein Distance of two objects A and B, which are vectors, lists, or strings. A and B need not be of the same type. PRED is the element equivalence predicate used. (levenshtein/predicate '#(#\A #\B #\C #\D) "aBXcD" char-ci=?) => 1 > (levenshtein a b) Calculate the levenshtein distance of A and B, in a similar manner as using `levenshtein/predicate' with `equal?' as the predicate. (define g '#(#\g #\u #\m #\b #\o)) (levenshtein g "gambol") => 2 (levenshtein g "dumbo") => 1 (levenshtein g "umbrage") => 5 Tests ***** The `levenshtein.scm' test suite can be enabled by editing the source code file and loading Testeez (http://www.neilvandyke.org/testeez/). History ******* Version 0.4 -- 2005-07-10 Added Testeez tests. Version 0.3 -- 2005-07-09 PLaneT release, and minor documentation changes. Version 0.2 -- 2004-07-06 Documentation changes. Version 0.1 -- 2004-05-13 First release. Tested only lightly, and today _is_ the 13th, so caveat emptor.