Python-查找单词中所有可以找到的子单词

Question

Python-查找单词中所有可以找到的子单词

7

最终，我想找到英语字典中包含至少三个字母的子词最多的单词。我编写了这个算法，但速度太慢，不够实用。想知道如何进行优化。

def subWords(word):
    return set((word[0:i] for i in range(2, len(word)+1))) #returns all subWords of length 2 or greater

def checkDict(wordList, dictList):
    return set((word for word in wordList if word in dictList))

def main():
    dictList = [i.strip() for i in open('wordlist.txt').readlines()]
    allwords = list()
    maximum = (0, list())

    for dictWords in dictList:
        for i in range (len(dictWords)):
            for a in checkDict(subWords(dictWords[i: len(dictWords) + 1]), dictList):
                allwords.append(a)

        if len(allwords) > maximum[0]:
            maximum = (len(allwords), allwords)

        print maximum
        allwords = list()

    print maximum 
main()

- Parseltongue

你在任何地方都使用集合，除了真正重要的地方：“if word in dictList” 应该改为 “if word in dictSet”。 - user3850

顺便说一下，你的描述中写着“三个字母”，但是你的代码只有两个。 - user3850

5个回答

7

你的算法的主要弱点在于对每个子单词，你需要将它与字典中的每个单词进行比较。其实你不需要这样做——如果你的单词以“a”开头，你就不需要查看它是否与以“b”开头的单词匹配。如果下一个字母是“c”，那么你就没有必要将其与以“d”开头的单词进行比较。问题变成了：“如何高效地实现这个想法？”

为此，我们可以创建一棵树来表示字典中的所有单词。我们通过将字典中的每个单词扩展到树中，并着色最后一个节点来构建这棵树。

当我们想要测试一个子单词是否在这棵树中时，我们只需逐个字母地遍历该单词，并使用这些字母确定在树中下一步去哪里（从顶部开始）。如果我们发现没有地方可去，或者在整个子单词中经过后落在未着色的树节点上，则它不是一个单词。否则，如果我们落在已着色的节点上，则它是一个单词。这样做的效果是，我们可以一次性搜索整个字典，而不是一个单词一个单词地搜索。当然，这样做的代价是在开始时需要进行一些设置，但如果字典中有很多单词，这不是一个很大的代价。

好了，这都非常棒！让我们试着实现它：

class Node:
    def __init__( self, parent, valid_subword ):
        self.parent = parent
        self.valid_subword = valid_subword
        self.children = {}

    #Extend the tree with a new node
    def extend( self, transition, makes_valid_word ):
        next_node = None
        if transition in self.children:
            if makes_valid_word:
                self.children[transition].makes_valid_word = True
        else:
            self.children[transition] = Node( self, makes_valid_word )
        return self.children[transition]

def generateTree( allwords ):
  tree = Node( None, False )
    for word in allwords:
      makes_valid_word = False
      current_node = tree
      for i in range(len(word)):
        current_node = current_node.extend( word[i], True if i == len(word) - 1 else False )
  return tree

def checkDict( word, tree ):
    current_node = tree
    for letter in word:
        try:
            current_node = current_node.children[letter]
        except KeyError:
            return False

    return current_node.valid_subword

之后，随着时间的推移：

for word in allWords:
  for subword in subWords(word):
    checkDict(subword)
    #Code to keep track of the number of words found, like you already have

该算法允许您在O(m)的时间内检查一个单词是否在字典中，其中m是字典中最长单词的长度。请注意，对于包含任意数量单词的字典，这个时间复杂度保持大致不变。而您原来的算法每次检查的时间复杂度为O(n)，其中n是字典中单词的数量。

- Slubb

1

不错的算法，但我敢打赌它对于Python来说太底层了。有太多可能出错的地方，而且在查找set时Python可能更快。 - user3850

不知道你的代码是否有效，但手绘图片加一分。 - Gerrat

惊人的图片和解释，+1 - Parseltongue

3

如果想了解基础的Python，请看一下这个函数（基本上是JBernardo和Karl Knechtel建议的更快、更精练、符合PEP8标准的版本）：

def check_dict(word, dictionary): 
  """Return all subwords of `word` that are in `dictionary`."""
  fragments = set(word[i:j] 
                  for i in xrange(len(word) - 2) 
                  for j in xrange(i + 3, len(word) + 1))
  return fragments & dictionary

dictionary = frozenset(word for word in word_list if len(word) >= 3)
print max(((word, check_dict(word, dictionary)) for word in dictionary), 
          key=lambda (word, subwords): len(subwords)) # max = the most subwords

输出类似于：

('greatgrandmothers',
set(['and', 'rand', 'great', 'her', 'mothers', 'moth', 'mother', 'others', 'grandmothers', 'grandmother', 'ran', 'other', 'greatgrandmothers', 'greatgrandmother', 'grand', 'hers', 'the', 'eat']))

这是来自http://www.mieliestronk.com/wordlist.html的单词列表。

我知道你并不追求性能（上面的代码对于标准英语词汇量为58k的单词已经运行小于1s了）。

但是如果你需要在某个内部循环中运行得非常快的话，可以参考以下方法：

应避免在堆上创建check_dict中所有子字符串的副本，这是主要的性能瓶颈。
可以通过指针算术来实现，仅使用指针分隔符来表示子串（而不是完整的对象）。
使用该子串快速确定它是否是有效词汇的一部分：
- 使用trie数据结构或其内存友好版本PATRICIA树
- 从您的字典构建静态trie，然后进行快速的子串查找
- 逐步洗牌指针以探索所有可能的子串，并增加命中计数器以获取有效单词
- 这样可以避免任何动态分配（无字符串、无集合），速度极快！
在Python中所有这些都不是非常相关，因为这样的内存管理过于低级，您最好不要使用Python进行对于性能至关重要的代码。

- Radim

1

此程序运行时间仅需几秒钟。"sowpods.txt"文件中包含267627个长度大于等于3的单词。如果您使用的是Python2.5或2.6版本，则需要使用at_least_3 = set(w for w in words if len(w)>=3)

words = open("sowpods.txt").read().split()

at_least_3 = {w for w in words if len(w)>=3}

def count_subwords(word):
    counter = 0
    for i in range(len(word)-2):
        for j in range(i+3,len(word)+1):
            candidate = word[i:j]
            if candidate in at_least_3:
                counter += 1
    return counter

for row in sorted((count_subwords(w),w) for w in at_least_3):
    print row

最多的子单词数量为26个

(26, 'CORESEARCHERS')
(26, 'FOREGONENESSES')
(26, 'METAGENETICALLY')
(26, 'PREPOSSESSIONS')
(26, 'SACRAMENTALISTS')
(26, 'WHOLESOMENESSES')

- John La Rooy

0

这就是你在问的，还是我漏掉了什么？

>>> words = ['a', 'asd', 'asdf', 'bla']
>>> [sum(1 for i in (a for a in words if a in b)) for b in words]
[1, 2, 3, 2]

这是每个单词中包含的单词数，包括它本身。如果您不想计算少于3个字符的单词，请将其删除...

当然，它的时间复杂度是O(n²)

编辑：

问题要求所有子单词，但代码只要求具有更多子单词的子单词... 如果您真的想要第一个行为，请删除sum(...)部分并将genexp变成列表推导式...

- JBernardo

因为我甚至无法确定那个东西到底是做什么的。 - user3850

2

@hop 那么你需要学习一点 Python…… sum(1 for i in ...) 很常见，意思是 len(iterable)。 - JBernardo

1

我非常熟悉那个习语。这个答案并没有教授任何关于良好编码风格的内容，也没有清晰地阐述正在发生什么，也没有尝试遵循原帖作者的代码，很可能也不能更快地解决问题。 - user3850

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Karl Knechtel · Accepted Answer

1) 风格和组织：更有意义的做法是创建一个单一函数来生成所有单词的子序列。

2) 风格：使用set时不需要双括号。

3) 性能（我希望）：从要查找的单词中创建一个set，然后您可以使用内置的set交集检查功能。

4) 性能（几乎肯定）：不要手动循环查找最大元素；使用内置的max。您可以直接比较（长度、元素）元组；Python按照每个元素从头到尾的每对元素进行比较，就好像每个元素都是字符串中的一个字母。

5) 性能（可能）：确保字典中没有1或2个字母的单词，因为它们会妨碍。

6) 性能（遗憾的事实）：不要把所有东西都分解成一个函数。

7) 风格：文件I/O应该使用with块来确保资源的正确清理，文件迭代器默认情况下通过行进行迭代，因此我们可以隐式地获取行的列表，而无需调用.readlines()。

最终结果（未经过适当测试，除了“fragments”表达式）：

def countedSubWords(word, dictionary): 
  fragments = set(
    word[i:j]
    for i in range(len(word)) for j in range(i+3, len(word)+1)
  )
  subWords = fragments.intersection(dictionary)
  return (len(subWords), subWords)


def main():
  with open('wordlist.txt') as words:
    dictionary = set(word.strip() for word in words if len(word.strip()) > 2)
    print max(countedSubWords(word, dictionary) for word in dictionary)