两个单词之间的最小距离(Python)

4

以下是我能想到的两个版本。当两个词都很常见时(比如 "is" 和 "the",对于这种情况,第一版的 n1 * n2 缩放将成为一个问题),并且更能抵御恶意输入(比如只有两个单词的文件),建议使用 V2。但对于更有趣的查询(比如 "big" 和 "animal"),V1 与 V2 一样快,我可以想到更现实的语义问题,V1 可以完成,而 V2 却不能。有没有方法加速它?

import timeit t1 = timeit.default_timer()

def distance(version, filename, wordOne, wordTwo):

f = open(filename, 'rU')
text = f.read()
f.close()
index = 0
distance = index
version = int(version)
print 'inputs', filename, wordOne, wordTwo
countOne = 0
countTwo = 0

print 'version', version

if version == 1:
    word_pos = {}
    for word in text.split():
        if word in [wordOne, wordTwo]:
            if word in word_pos.keys():
                word_pos[word].append(index)
            else:
                word_pos[word] = [index]

        index += 1

    countOne = len(word_pos[wordOne])
    countTwo = len(word_pos[wordTwo])

    distances = []
    low = 0
    high = index
    for posOne in word_pos[wordOne]:
        for posTwo in word_pos[wordTwo]:
            #shrink innner loop by distance?:
            #for posTwo in range(int(posOne-distance), (posOne+distance)):
            #if abs(posOne-posTwo) < distance:
            #distance = abs(posOne-posTwo)
            distances.append(abs(posOne-posTwo))
    distance = min(distances)

elif version == 2:
    switch = 0
    indexOne = 0
    indexTwo = 0
    distance = len(text)
    for word in text.split():

        if word == wordOne:
            indexOne = index
            countOne += 1
        if word == wordTwo:
            indexTwo = index
            countTwo += 1

        if indexOne != 0 and indexTwo != 0:
            if distance > abs(indexOne-indexTwo):
                distance = abs(indexOne - indexTwo)

        index += 1

t2 = timeit.default_timer()
print 'Delta t:', t2 - t1

print 'number of words in text:', index
print 'number of occurrences of',wordOne+':', countOne
print 'number of occurrences of',wordTwo+':', countTwo
if countOne < 1 or countTwo < 1:
    print 'not all words are present'
    return 1

print 'Shortest distance between \''+wordOne+'\' and \''+wordTwo+'\' is', distance, 'words'
return distance

你的第二个版本不起作用,你会在 if word == wordOne 中遇到 NameError 错误。wordOne 是如何初始化的? - Anton Zuenko
对我有用。wordOne是一个输入。它是否对你抛出了NameError? - cosmologist
1个回答

1
在v2中昂贵的部分是if indexOne != 0 ...块。它被调用的次数与文本中剩余单词的数量一样多,一旦找到wordOnewordTwo。使用开关变量(我看到您有意使用它:))可以将if块移动到if word == wordOneif word == wordTwo中。在这种情况下,该块被调用的次数少于n1 + n2次。
这是代码。请注意,我们不再需要检查索引。
elif version == 3:
    last_word_is_one = None
    indexOne = 0
    indexTwo = 0
    countOne = 0
    countTwo = 0
    distance = len(text)
    for word in text.split():

        if word == wordOne:
            indexOne = index
            countOne += 1

            if last_word_is_one == False:
                if distance > abs(indexOne-indexTwo):
                    distance = abs(indexOne - indexTwo)

            last_word_is_one = True

        if word == wordTwo:
            indexTwo = index
            countTwo += 1

            if last_word_is_one == True:
                if distance > abs(indexOne-indexTwo):
                    distance = abs(indexOne - indexTwo)

            last_word_is_one = False

        index += 1

不错,你说得对,那正是我想做的事情 :-) - cosmologist

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接