在Python中将一个字符串拆分成单独的单词

Question

在Python中将一个字符串拆分成单独的单词

pythonstringtrie

4

我有一个大型的域名列表（大约六千个），我想知道哪些单词趋势最高，以便粗略地了解我们的投资组合。

我的问题是这个列表格式化为域名，例如：

examplecartrading.com

examplepensions.co.uk

exampledeals.org

examplesummeroffers.com

+5996

仅运行字数统计会产生垃圾结果。因此，我想最简单的方法是在整个单词之间插入空格，然后运行字数统计。

出于我的健康考虑，我希望能够编写脚本完成这个任务。

我对Python 2.7知之甚少，但我愿意接受任何建议，提供代码示例将非常有帮助。我被告知使用简单的字符串trie数据结构将是实现这一目标的最简单方法，但我不知道如何在Python中实现它。

- Christopher Long

1

你如何确定单词的边界？你有一个期望的单词字典吗？如果有两种不同的分割方式怎么办？ - Tim Pietzcker

据我所知，该组合中包含100%的英文单词，因此我猜想我需要将其与完整的英语词典进行比对？如果有一定数量的错误结果，我也可以在之后进行审核。 - Christopher Long

可能重复：如何从字符串中排序所有可能的单词 - Lauritz V. Thaulow

3个回答

1

with open('/usr/share/dict/words') as f:
  words = [w.strip() for w in f.readlines()]

def guess_split(word):
  result = []
  for n in xrange(len(word)):
    if word[:n] in words and word[n:] in words:
      result = [word[:n], word[n:]]
  return result


from collections import defaultdict
word_counts = defaultdict(int)
with open('blah.txt') as f:
  for line in f.readlines():
    for word in line.strip().split('.'):
      if len(word) > 3:
        # junks the com , org, stuff
        for x in guess_split(word):
          word_counts[x] += 1

for spam in word_counts.items():
  print '{word}: {count}'.format(word=spam[0],count=spam[1])

这里有一个蛮力方法，它只尝试将域名分成两个英文单词。如果域名不能分成两个英文单词，则会被丢弃。将其扩展以尝试更多的拆分应该很简单，但是除非你聪明，否则它可能不会随着拆分数量的增加而扩展得很好。幸运的是，我猜你最多只需要3或4次拆分。

输出：

deals: 1
example: 2
pensions: 1

- wim

谢谢您的回复，不过lazyr是正确的，我想要的是组成域名的单词数量。 - Christopher Long

好的，抱歉，这是一个更难的问题，因为存在歧义。我会相应地修改我的代码。 - wim

1

假设您只有几千个标准域名，您应该能够在内存中完成所有操作。

domains=open(domainfile)
dictionary=set(DictionaryFileOfEnglishLanguage.readlines())
found=[]
for domain in domains.readlines():
    for substring in all_sub_strings(domain):
        if substring in dictionary:
            found.append(substring)
from collections import Counter
c=Counter(found) #this is what you want

print c

- Rusty Rob

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Lauritz V. Thaulow · Accepted Answer

我们尝试将域名（s）从一组已知单词（words）中分割成任意数量的单词（不仅仅是2个）。递归万岁！

def substrings_in_set(s, words):
    if s in words:
        yield [s]
    for i in range(1, len(s)):
        if s[:i] not in words:
            continue
        for rest in substrings_in_set(s[i:], words):
            yield [s[:i]] + rest

这个迭代器函数首先会返回它所调用的字符串，如果该字符串在 words 中存在。然后它会将该字符串按所有可能的方式分成两部分。如果第一部分不在 words 中，它会尝试下一个分割点。如果在其中，它会将第一部分添加到所有对第二部分调用自身的结果中（第二部分可能为空，例如 ["example", "cart", ...]）

接着我们构建英语词典：

# Assuming Linux. Word list may also be at /usr/dict/words. 
# If not on Linux, grab yourself an enlish word list and insert here:
words = set(x.strip().lower() for x in open("/usr/share/dict/words").readlines())

# The above english dictionary for some reason lists all single letters as words.
# Remove all except "i" and "u" (remember a string is an iterable, which means
# that set("abc") == set(["a", "b", "c"])).
words -= set("bcdefghjklmnopqrstvwxyz")

# If there are more words we don't like, we remove them like this:
words -= set(("ex", "rs", "ra", "frobnicate"))

# We may also add words that we do want to recognize. Now the domain name
# slartibartfast4ever.co.uk will be properly counted, for instance.
words |= set(("4", "2", "slartibartfast"))

现在我们可以把事情放在一起：

count = {}
no_match = []
domains = ["examplecartrading.com", "examplepensions.co.uk", 
    "exampledeals.org", "examplesummeroffers.com"]

# Assume domains is the list of domain names ["examplecartrading.com", ...]
for domain in domains:
    # Extract the part in front of the first ".", and make it lower case
    name = domain.partition(".")[0].lower()
    found = set()
    for split in substrings_in_set(name, words):
        found |= set(split)
    for word in found:
        count[word] = count.get(word, 0) + 1
    if not found:
        no_match.append(name)

print count
print "No match found for:", no_match

结果: {'ions': 1, 'pens': 1, 'summer': 1, 'car': 1, 'pensions': 1, 'deals': 1, 'offers': 1, 'trading': 1, 'example': 4}

使用set来包含英语词典可快速进行成员检查。 -=从集合中删除项目，|=将其添加到其中。

使用all函数和生成器表达式可以提高效率，因为all在第一个False时返回。

有些子字符串可能是有效的单词，无论作为整体还是拆分，例如“example” / “ex” +“ample”。对于某些情况，我们可以通过排除不需要的单词来解决问题，例如上面代码示例中的“ex”。对于其他情况，例如“pensions” / “pens” +“ions”，则可能无法避免，当发生这种情况时，我们需要防止字符串中的所有其他单词被多次计数（一次用于“pensions”，一次用于“pens”+“ions”）。我们通过在每个域名的找到的单词集合中跟踪找到的单词 - 集合忽略重复项 - 然后在找到所有单词后计算单词次数来实现此目的。

编辑：重新组织并添加了大量注释。强制字符串小写以避免因大小写而错过。还添加了一个列表来跟踪没有匹配单词组合的域名。

NECROMANCY EDIT:更改子字符串函数，使其更具可扩展性。旧版本对于长度超过16个字符左右的域名变得非常缓慢。使用上面的四个域名，我将自己的运行时间从3.6秒提高到0.2秒！