Patchwork [6,of,9,sethelp] levenshtein: import public/CC0 implementation of pure-py levenshtein distance

login
register
mail settings
Submitter Augie Fackler
Date Feb. 4, 2015, 6:15 p.m.
Message ID <4b2d400e8c7daa1746de.1423073711@arthedain.pit.corp.google.com>
Download mbox | patch
Permalink /patch/7665/
State Changes Requested
Headers show

Comments

Augie Fackler - Feb. 4, 2015, 6:15 p.m.
# HG changeset patch
# User Augie Fackler <augie@google.com>
# Date 1422304840 18000
#      Mon Jan 26 15:40:40 2015 -0500
# Node ID 4b2d400e8c7daa1746de8516f70a3d468bc9c544
# Parent  232bd67d5c1350c85e9f787e1f3be62a02bb7872
levenshtein: import public/CC0 implementation of pure-py levenshtein distance

Imported from http://hetland.org/coding/python/levenshtein.py. There
are other implementations on pypi, but this one is nicely simple and
unencumbered.
Matt Mackall - Feb. 4, 2015, 10:18 p.m.
On Wed, 2015-02-04 at 13:15 -0500, Augie Fackler wrote:
> # HG changeset patch
> # User Augie Fackler <augie@google.com>
> # Date 1422304840 18000
> #      Mon Jan 26 15:40:40 2015 -0500
> # Node ID 4b2d400e8c7daa1746de8516f70a3d468bc9c544
> # Parent  232bd67d5c1350c85e9f787e1f3be62a02bb7872
> levenshtein: import public/CC0 implementation of pure-py levenshtein distance
> 
> Imported from http://hetland.org/coding/python/levenshtein.py. There
> are other implementations on pypi, but this one is nicely simple and
> unencumbered.

Stock difflib is good enough?

$ python
Python 2.7.5+ (default, Sep 17 2013, 15:31:50) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> a = difflib.SequenceMatcher(None, "blaze", "belize")
>>> a.ratio()
0.7272727272727273

Patch

diff --git a/mercurial/levenshtein.py b/mercurial/levenshtein.py
new file mode 100644
--- /dev/null
+++ b/mercurial/levenshtein.py
@@ -0,0 +1,34 @@ 
+#!/usr/bin/env python
+
+# This is a straightforward implementation of a well-known algorithm, and thus
+# probably shouldn't be covered by copyright to begin with. But in case it is,
+# the author (Magnus Lie Hetland) has, to the extent possible under law,
+# dedicated all copyright and related and neighboring rights to this software
+# to the public domain worldwide, by distributing it under the CC0 license,
+# version 1.0. This software is distributed without any warranty. For more
+# information, see <http://creativecommons.org/publicdomain/zero/1.0>
+
+
+def levenshtein(a,b):
+    "Calculates the Levenshtein distance between a and b."
+    n, m = len(a), len(b)
+    if n > m:
+        # Make sure n <= m, to use O(min(n,m)) space
+        a,b = b,a
+        n,m = m,n
+        
+    current = range(n+1)
+    for i in range(1,m+1):
+        previous, current = current, [i]+[0]*n
+        for j in range(1,n+1):
+            add, delete = previous[j]+1, current[j-1]+1
+            change = previous[j-1]
+            if a[j-1] != b[i-1]:
+                change = change + 1
+            current[j] = min(add, delete, change)
+            
+    return current[n]
+
+if __name__=="__main__":
+    from sys import argv
+    print levenshtein(argv[1],argv[2])