The Levenshtein distance as a standardized measure of similarity

July 12, 2017 03:21pm
by Thomas Tran



In this article, we'll explore the Levenshtein distance and learn how to use it to find similarity between two strings. This exploration will start with the theory behind this similarity metric, why it is important, and how it is used contemporaneously. It will then provide a quick (bare-to-the-bone) example of how to implement it using Python. External libraries will also be suggested.


The Theory

The Levenshtein distance measures similarities between two different strings, and it can even be generalized to any sequence of data with a measurable atomic unit. It even works for sequences of possibly different length. It has been applied to diverse fields, such as linguistics, speech recognition, spell-checking, and DNA analysis. Whenever you use a program using some sort of spell-checking or error correction, the programmers have most likely used the Levenshtein distance.

The formal definition of the metric is as follows:

Given two strings x and y, the Levenshtein distance is the minimum number of insertions, deletions, and substitutions of single atomic unit of data that are necessary in order to transform x into y. The most similar sequences are the ones with the smallest distance while the least similar sequences have the highest distance.

We can implement this algorithm using a matrix (i.e. a two-dimentional array), and then calculate the distance between each character in the first string with each character in the second. However, there are some notes that we should be aware of:

  • 1. The distance is zero if and only if the strings are equal.
  • 2. The distance is at most the length of the longer string.
  • 3. The distance is at least the difference of the sizes of the two strings.
    • Algorithm complexity is O(m*n) for the average case, where m is the length of the first string and n is the length of the second string.


      Implementation with Python

      The following code describes our implementation of the Levenshtein distance using Python. You are free to copy-and-paste as you wish:

      import codecs
      
      def lev_dist(source, target):
          if source == target:
              return 0
      
      
          # prepare matrix
          slen, tlen = len(source), len(target)
          dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
          for i in xrange(slen+1):
              dist[i][0] = i
          for j in xrange(tlen+1):
              dist[0][j] = j
      
          # count distances
          for i in xrange(slen):
              for j in xrange(tlen):
                  cost = 0 if source[i] == target[j] else 1
                  dist[i+1][j+1] = min(
                                  dist[i][j+1] + 1,   # deletion
                                  dist[i+1][j] + 1,   # insertion
                                  dist[i][j] + cost   # substitution
                              )
          return dist[-1][-1]
      

      There are other libraries which specifically calculate this distance. Some examples include:

      Tools like Elasticsearch use the Levenshtein distance to fuzzy search as well as to find similar results. Other metrics of similarity, such as cosine similarity, also provide similar standardized methods of finding similarity. In a future article, we'll delve into the world of statistics to find out how to use cosine similarity to create a content-based recommendation engine!