doc.txt

levenshtein.scm: Levenshtein Distance Metric in Scheme

levenshtein.scm: Levenshtein Distance Metric in Scheme
******************************************************

Version 0.4, 2005-07-10, `http://www.neilvandyke.org/levenshtein-scm/'

by Neil W. Van Dyke <neil@neilvandyke.org>

     Copyright (C) 2004 - 2005 Neil W. Van Dyke.  This program is Free
     Software; you can redistribute it and/or modify it under the terms
     of the GNU Lesser General Public License as published by the Free
     Software Foundation; either version 2.1 of the License, or (at
     your option) any later version.  This program is distributed in
     the hope that it will be useful, but without any warranty; without
     even the implied warranty of merchantability or fitness for a
     particular purpose.  See <http://www.gnu.org/copyleft/lesser.html>
     for details.  For other license options and consulting, contact
     the author.

Introduction
************

This is a Scheme implementation of the "Levenshtein Distance"
algorithm, which is an "edit distance" metric of string similarity, due
to Vladimir Levenshtein.  The Levenshtein Distance is a function of two
strings that represents a count of single-character insertions,
deletions, and substitions that will change the first string to the
second.  More information is available in NIST DADS
(http://www.nist.gov/dads/HTML/Levenshtein.html) and the Michael
Gilleland article, "Levenshtein Distance in Three Flavors
(http://www.merriampark.com/ld.htm)."

   This implementation is modeled after a space-efficient Perl
implementation (http://www.mgilleland.com/ld/ldperl2.htm) by Jorge Mas
Trullenque.  It has been written in R5RS Scheme (with an `error'
procedure such as the one in SRFI-23), and extended to support
heterogeneous combinations of Scheme types (strings, lists, vectors),
user-supplied predicate functions, and optionally reusable scratch
vectors.

Basic Comparisons
*****************

In the current implementation, all comparisons are done internally via
vectors.

> (vector-levenshtein/predicate/get-scratch a b pred)
          get-scratch
     Few, if any, programs will use this procedure directly.  This is
     like `vector-levenshtein/predicate', but allows GET-SCRATCH to be
     specified.  GET-SCRATCH is a procedure of one term, n, that yields
     a vector of length n or greater, which is used for record-keeping
     during execution of the Levenshtein algorithm.  `make-vector' can
     be used for GET-SCRATCH, although some programs comparing a large
     size or quantity of vectors may wish to reuse a record-keeping
     vector, rather than each time allocating a new one that will need
     to be garbage-collected.

> (vector-levenshtein/predicate a b pred)
> (vector-levenshtein/eq a b)
> (vector-levenshtein/eqv a b)
> (vector-levenshtein/equal a b)
> (vector-levenshtein a b)
     Calculate the Levenshtein Distance of vectors A and B.  PRED is
     the predicate procedure for determining if two elements are equal.
     The `/eq', `/eqv', and `/equal' variants correspond to the
     standard equivalence predicates, `eq?', `eqv?', and `equal?'.
     `vector-levenshtein' is an alias for `vector-levenshtein/equal'.

          (vector-levenshtein '#(6 6 6) '#(6 35 6 24 6 32)) => 3

> (list-levenshtein/predicate a b pred)
> (list-levenshtein/eq a b)
> (list-levenshtein/eqv a b)
> (list-levenshtein/equal a b)
> (list-levenshtein a b)
     Calculate the Levenshtein Distance of lists A and B.  PRED is the
     predicate procedure for determining if two elements are equal.
     The `/eq', `/eqv', and `/equal' variants correspond to the
     standard equivalence predicates, `eq?', `eqv?', and `equal?'.
     `list-levenshtein' is an alias for `list-levenshtein/equal'.  Note
     that comparison of lists is less efficient than comparison of
     vectors.

          (list-levenshtein/eq '(b c e x f y) '(a b c d e f)) => 4

> (string-levenshtein a b)
     Calculate the Levenshtein Distance of strings A and B.

          (string-levenshtein "adresse" "address") => 2

Type-Coercing Comparisons
*************************

Procedures `levenshtein' and `levenshtein/predicate' provide a
convenient interface for comparing a combination of vectors, lists, and
strings, the types of which might not be known until runtime.

> (levenshtein/predicate a b pred)
     Calculates the Levenshtein Distance of two objects A and B, which
     are vectors, lists, or strings.  A and B need not be of the same
     type.  PRED is the element equivalence predicate used.

          (levenshtein/predicate '#(#\A #\B #\C #\D)
                                 "aBXcD"
                                 char-ci=?)
          => 1

> (levenshtein a b)
     Calculate the levenshtein distance of A and B, in a similar manner
     as using `levenshtein/predicate' with `equal?' as the predicate.

          (define g '#(#\g #\u #\m #\b #\o))

          (levenshtein g "gambol")  => 2
          (levenshtein g "dumbo")   => 1
          (levenshtein g "umbrage") => 5

Tests
*****

The `levenshtein.scm' test suite can be enabled by editing the source
code file and loading Testeez (http://www.neilvandyke.org/testeez/).

History
*******

Version 0.4 -- 2005-07-10
     Added Testeez tests.

Version 0.3 -- 2005-07-09
     PLaneT release, and minor documentation changes.

Version 0.2 -- 2004-07-06
     Documentation changes.

Version 0.1 -- 2004-05-13
     First release.  Tested only lightly, and today _is_ the 13th, so
     caveat emptor.