如何快速在Python字符串中查找单词

Question

如何快速在Python字符串中查找单词

3

我想确定给定字符串中的单词，这些字符串是域名。我有大约5000个域名和一个包含60000个字典单词的字典需要检查。这将导致每个域名检查60000次，总计大约300.000.000次操作，这就是疯狂。

因此，我想问是否有更聪明的方法来解决这个问题，仍然可以得到字符串中存在的单词。

我已经尝试过使用简单的循环来做到这一点，但我认为这需要一个更聪明的解决方案才能处理大量的检查。

dictionary_of_words = ["I", "Stack", "overflow", "like", etc]
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", etc]

def in_dictionary(AllDomains):
    #Setting a new column
    AllDomains["dictionary"] = False
    AllDomains["words"] = None

    for i in range(len(AllDomains)):
        # Scan if the entire word is in the dictionary
        if AllDomains["domain"].str.strip(".nl").str.lower().iloc[i] in dictionary_of_words:
            AllDomains["dictionary"].iloc[i] = True
            print(AllDomains["domain"].iloc[i])

        # Scan which words there are in the domain
        else:
            for word in dictionary_of_words:
                print(word)
                if word in AllDomains["domain"].str.strip(".nl").str.lower().iloc[i]:
                    if AllDomains["words"].iloc[i] == None:
                        AllDomains["words"].iloc[i] = word
                    else:
                        AllDomains["words"].iloc[i] = AllDomains["words"].iloc[i] + f", {word}"

                    print(AllDomains["domain"].iloc[i])

    return AllDomains

- Sil Bouwman

看一下Python字典数据结构。 - jdigital

好的，用于模式匹配的算法是Aho-Corasick。首先将字典预处理成一棵trie树，然后再将字符串输入进去。 - conditionalMethod

4个回答

1

我并没有看到其他解决方案，但你可以通过以下方式简化你的代码：

dictionary_of_words = ["I", "Stack", "overflow", "like", 'etc']
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", 'etc']

# this is just to be sure they have the same case
dictionary_of_words = list(map(str.lower, dictionary_of_words))
AllDomains = list(map(str.lower, AllDomains))

for word in dictionary_of_words:
    matches = [domain for domain in AllDomains if word in domain]
    print(word, matches)

我将转换为小写字母，只是为了与示例数据更多地匹配，不确定您是否需要，但域名始终是小写字母。

此外，600000个单词很多，您确定其中没有重复吗？

- mucio

1

你需要的是一种算法解决方案，而不是基于Python的解决方案。

因此，我建议使用一种名为Aho-Corasick字符串匹配算法的算法。使用此算法，您可以一次扫描它们所有。（但是准备时会有一些开销）

该算法同时搜索给定字符串的几个模式。在您的情况下，该算法搜索所有dictionary_of_words中的"stackoverflow.com"。因此，我们可以很容易地使用for循环搜索所有域。

这是我为您的要求编写的代码。算法来源


class AhoNode:
    def __init__(self):
        self.goto = {}
        self.out = []
        self.fail = None

def aho_create_forest(patterns):
    root = AhoNode()

    for path in patterns:
        node = root
        for symbol in path:
            node = node.goto.setdefault(symbol, AhoNode())
        node.out.append(path)
    return root


def aho_create_statemachine(patterns):
    root = aho_create_forest(patterns)
    queue = []
    for node in root.goto.itervalues():
        queue.append(node)
        node.fail = root

    while len(queue) > 0:
        rnode = queue.pop(0)

        for key, unode in rnode.goto.iteritems():
            queue.append(unode)
            fnode = rnode.fail
            while fnode != None and not fnode.goto.has_key(key):
                fnode = fnode.fail
            unode.fail = fnode.goto[key] if fnode else root
            unode.out += unode.fail.out

    return root


def aho_find_all(s, root, callback):
    node = root

    for i in xrange(len(s)):
        while node != None and not node.goto.has_key(s[i]):
            node = node.fail
        if node == None:
            node = root
            continue
        node = node.goto[s[i]]
        for pattern in node.out:
            callback(i - len(pattern) + 1, pattern)

############################
# Demonstration of work
def on_occurence(pos, patterns):
    print ("At pos %s found pattern: %s" % (pos, patterns))



AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com"]
patterns = ["I", "Stack", "overflow", "like"]

root = aho_create_statemachine(patterns)
for i in AllDomains:
  print(i)
  aho_find_all(i, root, on_occurence)
  print ""

这将产生类似于以下结果的输出：

stackoverflow.com
At pos 5 found pattern: overflow

iLikeStackoverflow.com
At pos 5 found pattern: Stack
At pos 10 found pattern: overflow

- Tharaka Ratnayake

太棒了！谢谢，听起来很不错。我今晚会试一下，看看现在需要多长时间。 - Sil Bouwman

当然，请告诉我您的时间安排。我也很好奇。 :-) - Tharaka Ratnayake

FYI，已经存在一些使用C加速的Aho-Corasick包，例如pyahocorasick，它应该表现更好，并且需要较少的自定义代码来维护。此答案中的代码仅适用于Py2，还有其他问题。 - ShadowRanger

我在尝试时遇到了这个错误：File "Models/inDictionary.py", line 47, in aho_create_statemachine for node in root.goto.itervalues(): AttributeError: 'dict' object has no attribute 'itervalues' - Sil Bouwman

它是Python 2.7中的函数。如果您使用的是Python 3.6，请告诉我。 - Tharaka Ratnayake

0

我的解决方案是将所有域名混合成一个大的文本汤。然后，您只需要循环600,000次来查找单词。希望能有所帮助。

import re

dictionary_of_words = ["I", "Stack", "overflow", "like", "etc"]
AllDomains = ["stackoverflow.com", "iLikeStackoverflow.com", "etc"]

domain_soup = " ".join(AllDomains).lower()

word_in_domain = {word: re.findall(r"[\S]*{}[\S]*".format(word.lower()), domain_soup) for word in dictionary_of_words}

输出：

{'I': ['ilikestackoverflow.com'],
 'Stack': ['stackoverflow.com', 'ilikestackoverflow.com'],
 'overflow': ['stackoverflow.com', 'ilikestackoverflow.com'],
 'like': ['ilikestackoverflow.com'],
 'etc': ['etc']}

此外，您可以考虑使用“set”来删除重复的单词或域名。

- Tyrion

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matt Timmermans · Accepted Answer

将所有单词放入一个集合中（不是列表），这样您就可以非常快速地检查字符串是否与任何单词匹配。
将所有单词的长度放入一个集合中。
对于每个长度，对于每个域名，获取该长度的所有子字符串，并检查它是否在单词集合中。

由于您可能最终会得到大约10个长度，而大多数域名都少于20个字符，因此每个域名只需要进行几百次快速检查即可。