为什么我在Python中使用任何方法都很慢?

3
我正在尝试搜索一组字符串(句子),并检查它们是否包含特定的子字符串集。为此,我使用Python的“any”函数。
sentences = ["I am in London tonight",
                   "I am in San Fran tomorrow",
                   "I am in Paris next Wednesday"]

# Imagine the following lists to contain 1000's of strings 
listOfPlaces = ["london", "paris", "san fran"] 
listOfTimePhrases = ["tonight", "tomorrow", "week", "monday", "wednesday", "month"]

start = time.time()

sntceIdxofPlaces = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sent in enumerate(sentences) if any(x in pos for x in listOfTimePhrases)]

end = time.time()

print(end-start)

如果您想象我的列表非常大,那么我发现在两个“any”语句上经过的时间相当长。我大约需要2秒钟来处理这样的“any”查询。您有任何关于为什么会花费这么长时间的想法,并且您知道如何使代码更快的方法吗?
谢谢。

@MorganThrapp 对此有点糊涂了。没有检查 sent 的内容。 - Moses Koledoye
6个回答

4

不要两次枚举你的句子。你可以通过一次循环遍历你的句子来完成两种检查。

sntceIdxofPlaces = []
sntceIdxofTimes = []
for pos, sent in enumerate(sentences):
    if any(x in sent for x in listOfPlaces):
        sntceIdxofPlaces.append(pos)
    if any(x in sent for x in listOfTimePhrases):
        sntceIdxofTimes.append(pos)

3
结合使用集合(set)来处理数据列表是解决问题的方式。 - Morgan Thrapp
@joeb 谢谢你。我刚刚测试了单个for循环,并采纳了其他人的建议。我不确定是否有点傻,但是并没有太大的收获。 - kPow989
根据我的时间测试(在下面的答案中比较foofoo2),这带来的收益微不足道。使用集合也没有任何好处,因为我们必须遍历集合,而不是测试它们的成员资格。 - jez
嗨,我意识到我没有看到太多的收益的原因可能是因为我的句子列表总是非常短(少于6个)。因此,在我的情况下,两个枚举的开销并不太大。谢谢。 - kPow989

2

有一个主要的低效率问题,即您没有使用集合。在集合中检查成员资格是非常高效的操作(与列表相比,其时间复杂度为O(1) vs O(n))。

sentences = ["I am in London tonight",
                   "I am in San Fran tomorrow",
                   "I am in Paris next Wednesday"]

# Imagine the following lists to contain 1000's of strings 
listOfPlaces = {"london", "paris", "san fran"} 
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}

start = time.time()

sntceIdxofPlaces = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfPlaces)]
sntceIdxofTimes = [pos for pos, sent in enumerate(sentences) if any(x in sent for x in listOfTimePhrases)]

end = time.time()

print(end-start)

谢谢Morgan。我刚刚使用了集合进行测试,我的代码大约快了10倍左右。如果我使用sent.split(),那么长字符串“San Fran”不就被分开了吗?这样我可能会在搜索中错过它们,对吧?再次感谢。 - kPow989
几乎每当您有一个静态的无序数据集,需要进行查找时,使用set都是值得的。在set上进行查找是O(1)操作,而在列表上则是O(n)。 - Morgan Thrapp
2
sent.split() 方法可以捕获大多数情况,但不幸的是会错过 "san fran",因为它被分成了两个单词。 - jez
@jez 你说得对,我完全忽略了那一点。我已经把我的回答中的那部分删除了。 - Morgan Thrapp

1

以下是三种替代方法,它们通过查找每个句子而不是查找目标短语列表来进行。所有这些方法都随着目标短语列表长度的增加而扩展得很差。

sentences = [
    "I am the walrus",
    "I am in London",
    "I am in London tonight",
    "I am in San Fran tomorrow",
    "I am in Paris next Wednesday"
]
sentences *= 1000   # really we want to examine large `listOfPlaces` and `listOfTimePhrases`, but these must be unique and so are harder to generate---this is an quicker dirtier way to test timing scalability

# Imagine the following lists to contain 1000's of strings 
listOfPlaces = {"london", "paris", "san fran"}
listOfTimePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}

# preprocess to guard against substring false positives and case-mismatch false negatives:
sentences = [ ' ' + x.lower() + ' ' for x in sentences ]
listOfPlaces = { ' ' + x.lower() + ' ' for x in listOfPlaces }
listOfTimePhrases = { ' ' + x.lower() + ' ' for x in listOfTimePhrases }

#listOfPlaces = list( listOfPlaces )
#listOfTimePhrases = list( listOfTimePhrases )

def foo():
    sntceIdxofPlaces = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfPlaces)]
    sntceIdxofTimes  = [pos for pos, sentence in enumerate(sentences) if any(x in sentence for x in listOfTimePhrases)]
    return sntceIdxofPlaces, sntceIdxofTimes

def foo2():
    sntceIdxofPlaces = []
    sntceIdxofTimes = []
    for pos, sentence in enumerate(sentences):
        if any(x in sentence for x in listOfPlaces): sntceIdxofPlaces.append(pos)
        if any(x in sentence for x in listOfTimePhrases): sntceIdxofTimes.append(pos)
    return sntceIdxofPlaces, sntceIdxofTimes

def foo3():
    sntceIdxofPlaces = []
    sntceIdxofTimes = []
    for pos, sentence in enumerate(sentences):
        for x in listOfPlaces:
            if x in sentence: sntceIdxofPlaces.append(pos); break
        for x in listOfTimePhrases:
            if x in sentence: sntceIdxofTimes.append(pos); break
    return sntceIdxofPlaces, sntceIdxofTimes

这是时间测试结果:

In [171]: timeit foo()
100 loops, best of 3: 15.6 ms per loop

In [172]: timeit foo2()
100 loops, best of 3: 16 ms per loop

In [173]: timeit foo3()
100 loops, best of 3: 8.07 ms per loop

似乎any()可能效率低下,这让我很惊讶。即使已经找到匹配项并且已知答案,它可能会运行其输入生成器到末尾。我知道它不应该像这样工作,但我无法解释foo2()foo3()之间运行时间差异的两倍。它们似乎提供了相同的输出。

此外:由于正在迭代listOfPlaceslistOfTimePhrases,而不是测试成员资格,因此它们是否为setlist对时间没有影响。


any函数不会像这个答案所述的那样执行... - double_j
我也不曾想到,但计时结果非常显著。而且它们是可重复的(我在2013年的MacBook上使用anaconda Python 2.7.12以不同的顺序运行了几次)。这三个函数似乎返回相同的输出,所以如果其中一个有错误,我看不出来。 - jez
如果我理解正确的话,这里找到了类似的内容:https://www.dotnetperls.com/any-python - kPow989

0

使用集合而不是列表来存储listOfPlaceslistOfTimePhrases。集合具有更快的查找时间:

listOfPlaces = set(["london", "paris", "san fran"])
listOfTimePhrases = set(["tonight", "tomorrow", "week", "monday", "wednesday", "month"])

0

这里有一个解决方案,利用了对listOfPlaceslistOfTimePhrases作为set的快速查找,同时考虑到可能的多词短语。即使listOfPlaceslistOfTimePhrases包含数千个元素,它也可以快速运行。

sentences = [
    "I am the walrus",
    "I am in London",
    "I am in London tonight",
    "I am in San Fran tomorrow",
    "I am in Paris next Wednesday"
]
sentences *= 1000

# Imagine the following lists to contain 1000's of strings 
placePhrases = {"london", "paris", "san fran"}
timePhrases = {"tonight", "tomorrow", "week", "monday", "wednesday", "month"}


# preprocess to guard against case-mismatch false negatives:
sentences = [ x.lower() for x in sentences ]
placePhrases = { x.lower() for x in placePhrases }
timePhrases = { x.lower() for x in timePhrases }

# create additional sets in which incomplete phrases can be looked up:
placePrefixes = set()
for phrase in placePhrases:
    words = phrase.split()
    if len(words) > 1:
        for i in range(1, len(words)):
            placePrefixes.add(' '.join(words[:i]))
timePrefixes = set()
for phrase in timePhrases:
    words = phrase.split()
    if len(words) > 1:
        for i in range(1, len(words)):
            timePrefixes.add(' '.join(words[:i]))

def scan(sentences):
    sntceIdxofPlaces = []
    sntceIdxofTimes = []
    for pos, sentence in enumerate(sentences):
        hasPlace = hasTime = False
        placePrefix = timePrefix = ''
        for word in sentence.split():
            if not hasPlace:
                placePhrase = placePrefix + (' ' if placePrefix else '') + word
                if placePhrase in placePrefixes:
                    placePrefix = placePhrase
                else:
                    placePrefix = ''
                    if placePhrase in placePhrases: hasPlace = True
            if not hasTime:
                timePhrase = timePrefix + (' ' if timePrefix else '') + word
                if timePhrase in timePrefixes:
                    timePrefix = timePhrase
                else:
                    timePrefix = ''
                    if timePhrase in timePhrases: hasTime = True
            if hasTime and hasPlace: break
        if hasPlace: sntceIdxofPlaces.append(pos)
        if hasTime: sntceIdxofTimes.append(pos)
    return sntceIdxofPlaces, sntceIdxofTimes

0

我成功地找到了一种更快的方法,使用scikit的countvectorise函数来完成这个任务。

sentences = ["I am in London tonight",
               "I am in San Fran tomorrow",
               "I am in Paris next Wednesday"]

listOfPlaces = ["london", "paris", "san fran"]

cv = feature_extraction.text.CountVectorizer(vocabulary=listOfPlaces)

# We now get in the next step a vector of size len(sentences) x len(listOfPlaces)           
taggedSentences = cv.fit_transform(sentences).toarray()

aggregateTags = np.sum(taggedSentences, axis=1)

我们最终得到一个大小为 len(sentences) x 1 的向量,其中每一行都记录了单词列表中的单词在每个句子中出现的次数。
我发现在处理大数据集时,结果非常快,例如 (0.02秒)。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接