单词之间的删除距离

6
我试图找出需要删除多少个字符才能使两个单词相同。例如,“at”和“cat”是1,因为我可以删除“c”,“boat”和“got”是3,因为我可以删除“b”,“a”和“g”使它变成“ot”。我将这些单词放入一个带有它们计数的字典中。然后我遍历这个字典,看看那个键是否存在于另一个字典中,否则我会将差异加1。这是一种非常低效的算法吗?
但它高估了我需要删除的数量。
def deletiondistance(firstword, secondword):
dfw = {}
dsw = {}
diff = 0
for i in range(len(firstword)):
    print firstword[i]
    if firstword[i] in dfw:
        dfw[firstword[i]]+=1
    else:
        dfw[firstword[i]]=1
for j in range(len(secondword)):
    if secondword[j] in dsw:
        dsw[secondword[j]] +=1
    else:
        dsw[secondword[j]]=1

for key, value in dfw.iteritems():

    if key in dsw:
        #print "key exists"
        pass

    else:
        diff +=1

print "diff",diff

你的算法明显有误:deletiondistance("Hello", "Hello, world") 的输出应该是 0 - DYZ
它只做一个单词。 - justcurious
同样的差异:deletiondistance("Hello", "Helloworld") 返回 0 - DYZ
3个回答

6

正如@ Hulk所提到的,这与Levenshtein距离类似。唯一的区别是不允许替换,但可以通过使用替换成本为2来纠正,该成本与从两个字符串中删除字符相同。例如:

def dist(s1, s2):
    cur = list(range(len(s2) + 1))
    prev = [0] * (len(s2) + 1)
    for i in range(len(s1)):
        cur, prev = prev, cur
        cur[0] = i + 1
        for j in range(len(s2)):
            # Substitution is same as two deletions
            sub = 0 if s1[i] == s2[j] else 2
            cur[j+1] = min(prev[j] + sub, cur[j] + 1, prev[j+1] + 1)

    return cur[-1]

cases=[('cat','bat'),
       ('bat','cat'),
       ('broom', 'ballroom'),
       ('boat','got'),
       ('foo', 'bar'),
       ('foobar', '')]

for s1, s2 in cases:
    print('{} & {} = {}'.format(s1, s2, dist(s1, s2)))

输出:

cat & bat = 2
bat & cat = 2
broom & ballroom = 3
boat & got = 3
foo & bar = 6
foobar &  = 6

4

谢谢,我知道莱文斯坦距离,但我想尝试这种方法。 - justcurious

2
你可以使用difflib来完成这个任务。
例子:
import difflib

cases=[('cat','bat'),
       ('bat','cat'),
       ('broom', 'ballroom'),
       ('boat','got')]

for a,b in cases:     
    print('{} => {}'.format(a,b)) 
    cnt=0
    for i,s in enumerate(difflib.ndiff(a, b)):
        if s[0]==' ': continue
        elif s[0]=='-':
            print(u'Delete "{}" from position {}'.format(s[-1],i))
        elif s[0]=='+':
            print(u'Add "{}" to position {}'.format(s[-1],i))    
        cnt+=1  
    print("total=",cnt,"\n")

输出:

cat => bat
Delete "c" from position 0
Add "b" to position 1
total= 2 

bat => cat
Delete "b" from position 0
Add "c" to position 1
total= 2 

broom => ballroom
Add "a" to position 1
Add "l" to position 2
Add "l" to position 3
total= 3 

boat => got
Delete "b" from position 0
Add "g" to position 1
Delete "a" from position 3
total= 3 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接