Good afternoon,
Does anyone know of an "out-of-the-box" implementation of Levenshtein DFA (deterministic finite automata) in .NET (or easily translatable to it)? I have a very big dictionary with more than 160000 different words, and I want to, given an inicial word w, find all known words at Levenshtein distance at most 2 of w in an efficient way.
Of course, having a function which computes all possible edits at edit distance one of a given word and applying it again to each of these edits solves the problem (and in a pretty straightforwad way). The problem is effiency --- given a 7 letter word, this can already take over 1 second to complete, and I need something much more efficient --- if possible, as it is with Levenshtein DFAs, a solution that takes O(|w|) steps.
Edit: I know I can construct my own approach to the problem with a little bit of studying, but at the moment I can't afford reading Schulz and Mihov's 60-page-long articles.
Thank you very much.
We implemented this for apache lucene java, perhaps you could convert it to C# and save yourself time.
the main class is here: its just a builder to get Levenshtein DFAs from a string, using the Schulz and Mihov algorithm.
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/LevenshteinAutomata.java
the parametric descriptions (the precomputed tables) for Lev1 and Lev2 are here: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev1ParametricDescription.java
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/Lev2ParametricDescription.java
you might notice these are generated with a computer, we generated them with this script, using Jean-Phillipe Barrette's great moman implementation (python) http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/util/automaton/createLevAutomata.py
we generate the parametric descriptions as packed long[] arrays so that it won't make our jar file too large.
just modify the toAutomaton(int n) to fit your needs/DFA package. in our case we are using a modified form of the brics automaton package, where transitions are represented as unicode codepoint ranges.
efficient unit tests are difficult for this sort of thing, but here is what we came up with... it seems to be thorough and even found a bug (which was fixed immediately by the author!) in the moman implementation.
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/util/automaton/TestLevenshteinAutomata.java
Here you go.
/// <summary>
/// Levenshtein Distance Calculator
/// </summary>
public static int DistanceFrom(this string s, string t)
{
    int n = s.Length;
    int m = t.Length;
    int[,] d = new int[n + 1, m + 1];
    // Step 1
    if (n == 0)
        return m;
    if (m == 0)
        return n;
    // Step 2
    for(int i = 0; i <= n; d[i, 0] = i++) ;
    for(int j = 0; j <= m; d[0, j] = j++) ;
    // Step 3
    for (int i = 1; i <= n; i++)
    {
        //Step 4
        for (int j = 1; j <= m; j++)
        {
            // Step 5
            int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
            // Step 6
            d[i, j] = Math.Min(
                Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                d[i - 1, j - 1] + cost);
        }
    }
    // Step 7
    return d[n, m];
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With