使用单词列表计算Levenshtein距离

Question

使用单词列表计算Levenshtein距离

9

首先，我想说我是Python的新手。我正在尝试计算许多单词列表的Levenshtein距离。到目前为止，我已经成功地编写了一对单词的代码，但是在处理列表时遇到了一些问题。我只有两个单词列表，一个在下面，像这样： carlos stiv peter

我想使用Levenshtein距离来进行相似性分析。有人能告诉我如何加载列表，然后使用函数计算距离吗？

非常感谢！

这是我仅针对两个字符串的代码：

#!/usr/bin/env python
# -*- coding=utf-8 -*-

def lev_dist(source, target):
    if source == target:
        return 0

#words = open(test_file.txt,'r').read().split();

    # Prepare matrix
    slen, tlen = len(source), len(target)
    dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
    for i in xrange(slen+1):
        dist[i][0] = i
    for j in xrange(tlen+1):
        dist[0][j] = j

    # Counting distance
    for i in xrange(slen):
        for j in xrange(tlen):
            cost = 0 if source[i] == target[j] else 1
            dist[i+1][j+1] = min(
                            dist[i][j+1] + 1,   # deletion
                            dist[i+1][j] + 1,   # insertion
                            dist[i][j] + cost   # substitution
                        )
    return dist[-1][-1]

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        print 'Usage: You have to enter a source_word and a target_word'
        sys.exit(-1)
    source, target = sys.argv[1], sys.argv[2]
    print lev_dist(source, target)

- El_Patrón

1

你想做什么？要计算列表中每对元素的距离吗？ - Fred Foo

1

步骤1. 添加代码来读取您的列表（或是两个列表？）。步骤2. 添加循环以遍历您的列表（或是两个列表？）。步骤3. 发布新代码，以便我们可以对其进行评论。您发布的代码很好，但您还需要编写接下来的两部分。 - S.Lott

感谢快速回答。 Larsmans：我想计算第一个列表中每个单词到第二个列表中每个单词的距离。 S.Lott：有两个列表！ - El_Patrón

有什么想法可以用两个列表来实现吗？ - El_Patrón

2个回答

5

不要重复造轮子：

http://pypi.python.org/pypi/python-Levenshtein/

- user2665694

8

有时候会有好的建议，但这也是了解车轮运作方式的最佳途径。 - grifaton

2

是的，我知道这个模块，但我想自己动手用Python学习一下！ - El_Patrón

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- El_Patrón · Accepted Answer

我最终在朋友的帮助下使代码工作了 :) 你可以计算Levenshtein距离，并将其与第二个列表中的每个单词进行比较，更改脚本中的最后一行，即：将print（list1 [0]，list2 [i]）改为将list1的第一个单词与list2中的每个单词进行比较。

谢谢

#!/usr/bin/env python
# -*- coding=utf-8 -*-

import codecs

def lev_dist(source, target):
    if source == target:
        return 0


    # Prepare a matrix
    slen, tlen = len(source), len(target)
    dist = [[0 for i in range(tlen+1)] for x in range(slen+1)]
    for i in range(slen+1):
        dist[i][0] = i
    for j in range(tlen+1):
        dist[0][j] = j

    # Counting distance, here is my function
    for i in range(slen):
        for j in range(tlen):
            cost = 0 if source[i] == target[j] else 1
            dist[i+1][j+1] = min(
                            dist[i][j+1] + 1,   # deletion
                            dist[i+1][j] + 1,   # insertion
                            dist[i][j] + cost   # substitution
                        )
    return dist[-1][-1]

# load words from a file into a list
def loadWords(file):
    list = [] # create an empty list to hold the file contents
    file_contents = codecs.open(file, "r", "utf-8") # open the file
    for line in file_contents: # loop over the lines in the file
        line = line.strip() # strip the line breaks and any extra spaces
        list.append(line) # append the word to the list
    return list

if __name__ == '__main__':
    import sys
    if len(sys.argv) != 3:
        print 'Usage: You have to enter a source_word and a target_word'
        sys.exit(-1)
    source, target = sys.argv[1], sys.argv[2]

    # create two lists, one of each file by calling the loadWords() function on the file
    list1 = loadWords(source)
    list2 = loadWords(target)

    # now you have two lists; each file has to have the words you are comparing on the same lines
    # now call you lev_distance function on each pair from those lists

    for i in range(0, len(list1)): # so now you are looping over a range of numbers, not lines
        print lev_dist(list1[0], list2[i])


#    print lev_dist(source, target)