levenshtein.scm: Levenshtein Distance Metric in Scheme

Version 0.3, 2005-07-09, http://www.neilvandyke.org/levenshtein-scm/

by Neil W. Van Dyke <neil@neilvandyke.org>

Copyright © 2004 - 2005 Neil W. Van Dyke. This program is Free Software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. See <http://www.gnu.org/copyleft/lesser.html> for details. For other license options and consulting, contact the author.

Introduction

This is a Scheme implementation of the Levenshtein Distance algorithm, which is an edit distance metric of string similarity, due to Vladimir Levenshtein. The Levenshtein Distance is a function of two strings that represents a count of single-character insertions, deletions, and substitions that will change the first string to the second. More information is available in NIST DADS and Michael Gilleland article, “in Three Flavors.”

This implementation is modeled after a space-efficient Perl implementation by Jorge Mas Trullenque. It has been written in R5RS Scheme (with an error procedure such as the one in SRFI-23), and extended to support heterogeneous combinations of Scheme types (strings, lists, vectors), user-supplied predicate functions, and optionally reusable scratch vectors.

This version 0.1 is an early release that has been tested only lightly.

Basic Comparisons

In the current implementation, all comparisons are done internally via vectors.

— Procedure: vector-levenshtein/predicate/get-scratch a b pred get-scratch

Few, if any, programs will use this procedure directly. This is like vector-levenshtein/predicate, but allows get-scratch to be specified. get-scratch is a procedure of one term, n, that yields a vector of length n or greater, which is used for record-keeping during execution of the Levenshtein algorithm. make-vector can be used for get-scratch, although some programs comparing a large size or quantity of vectors may wish to reuse a record-keeping vector, rather than each time allocating a new one that will need to be garbage-collected.

— Procedure: vector-levenshtein/predicate a b pred
— Procedure: vector-levenshtein/eq a b
— Procedure: vector-levenshtein/eqv a b
— Procedure: vector-levenshtein/equal a b
— Procedure: vector-levenshtein a b

Calculate the Levenshtein Distance of vectors a and b. pred is the predicate procedure for determining if two elements are equal. The /eq, /eqv, and /equal variants correspond to the standard equivalence predicates, eq?, eqv?, and equal?. vector-levenshtein is an alias for vector-levenshtein/equal.

          (vector-levenshtein '#(6 6 6) '#(6 35 6 24 6 32)) => 3
     
— Procedure: list-levenshtein/predicate a b pred
— Procedure: list-levenshtein/eq a b
— Procedure: list-levenshtein/eqv a b
— Procedure: list-levenshtein/equal a b
— Procedure: list-levenshtein a b

Calculate the Levenshtein Distance of lists a and b. pred is the predicate procedure for determining if two elements are equal. The /eq, /eqv, and /equal variants correspond to the standard equivalence predicates, eq?, eqv?, and equal?. list-levenshtein is an alias for list-levenshtein/equal. Note that comparison of lists is less efficient than comparison of vectors.

          (list-levenshtein/eq '(b c e x f y) '(a b c d e f)) => 4
     
— Procedure: string-levenshtein a b

Calculate the Levenshtein Distance of strings a and b.

          (string-levenshtein "adresse" "address") => 2
     

Type-Coercing Comparisons

Procedures levenshtein and levenshtein/predicate provide a convenient interface for comparing a combination of vectors, lists, and strings, the types of which might not be known until runtime.

— Procedure: levenshtein/predicate a b pred

Calculates the Levenshtein Distance of two objects a and b, which are vectors, lists, or strings. a and b need not be of the same type. pred is the element equivalence predicate used.

          (levenshtein/predicate '#(#\A #\B #\C #\D)
                                 "aBXcD"
                                 char-ci=?)
          => 1
     
— Procedure: levenshtein a b

Calculate the levenshtein distance of a and b, in a similar manner as using levenshtein/predicate with equal? as the predicate.

          (define g '#(#\g #\u #\m #\b #\o))
          
          (levenshtein g "gambol")  => 2
          (levenshtein g "dumbo")   => 1
          (levenshtein g "umbrage") => 5
     

History

Version 0.3 — 2005-07-09
PLaneT release, and minor documentation changes.
Version 0.2 — 2004-07-06
Documentation changes.
Version 0.1 — 2004-05-13
First release. Tested only lightly, and today is the 13th, so caveat emptor.