在列表中查找最少公共元素

Question

在列表中查找最少公共元素

5

我希望能生成一个有序列表，列出大量文本中最不常见的单词，最不常见的单词首先出现，并附带一个值来表示它在文本中出现的次数。

我从一些在线期刊文章中获取了文本，然后进行了简单的赋值和拆分操作。

article_one = """ large body of text """.split() 
=> ("large","body", "of", "text")

看起来正则表达式是下一步最合适的，但作为一个新手程序员，我并不熟悉它- 如果最佳答案包括正则表达式，能否有人指点我一下除了pydoc之外的好的正则表达式教程？

- Benjamin James

你是在寻找类似于“最不常见的10个单词”这样的东西，还是类似于“所有出现次数少于5次的单词”（可能在一次运行中为3，在另一次运行中为69105）的东西？我问这个问题是因为heapq.nsmallest 可能是前者的最佳选择，但堆对于后者可能不是那么好。 - abarnert

5个回答

3

在列表中查找最小公共元素。根据Collections模块中的计数器类。

c.most_common()[:-n-1:-1]       # n least common elements

所以，列表中最小公共元素的代码为：

from collections import Counter
Counter( mylist ).most_common()[:-2:-1]

两个最不常见的元素是

from collections import Counter
Counter( mylist ).most_common()[:-3:-1]

python-3.x

- dwalsh84

1

这里采用了稍微不同的方法，但似乎适合您的需求。使用了此答案中的代码。

#!/usr/bin/env python
import operator
import string

article_one = """A, a b, a b c, a b c d, a b c d efg.""".split()
wordbank = {}

for word in article_one:
    # Strip word of punctuation and capitalization
    word = word.lower().strip(string.punctuation)
    if word not in wordbank:
        # Create a new dict key if necessary
        wordbank[word] = 1
    else:
        # Otherwise, increment the existing key's value
        wordbank[word] += 1

# Sort dict by value
sortedwords = sorted(wordbank.iteritems(), key=operator.itemgetter(1))

for word in sortedwords:
    print word[1], word[0]

输出：

1 efg
2 d
3 c
4 b
5 a

适用于Python版本大于等于2.4，如果在底部括起来print语句并将iteritems更改为items，也适用于Python 3+。

- ashastral

我相信你可能想做的是 article_one.split()，现在你正在计算字母而不是单词 :) - Wolph

WoLpH：article_one 在创建时已经被 .split() 分割了。 - ashastral

我错过了那部分，只是看了输出结果。我改正我的错误 :) - Wolph

0

母舰上的现成答案。

# From the official documentation ->>
>>> # Tally occurrences of words in a list
>>> cnt = Counter()
>>> for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
...     cnt[word] += 1
>>> cnt
Counter({'blue': 3, 'red': 2, 'green': 1})
## ^^^^--- from the standard documentation.

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall('\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

>>> def least_common(adict, n=None):
.....:       if n is None:
.....:               return sorted(adict.iteritems(), key=itemgetter(1), reverse=False)
.....:       return heapq.nsmallest(n, adict.iteritems(), key=itemgetter(1))

显然要适应:D

- sotapme

这需要Python 2.7，如果你必须使用<2.7，请参考此帖子。https://dev59.com/KXI95IYBdhLWcg3wxA8- - Brian Larsen

如果你查看most_common的source，很容易看出如何为least_common进行适应。(我相信他们使用了heapq) - mgilson

3

简化版：collections.Counter(article_one.split()) 的意思是统计文章中每个单词出现的次数。 - Wolph

我将从第一个示例中的字典中提取最少出现的内容。谢谢大家。 - Benjamin James

@mgilson 刚看到这个; c.most_common()[:-n:-1] # n个最不常见的元素 - Benjamin James

显示剩余5条评论

0

如果您需要固定数量的最不常见单词，例如最不常见的10个单词，您可能需要使用计数器dict和heapq来解决，正如sotapme的答案（包括WoLpH的建议）或WoLpH的答案所建议的那样。

wordcounter = collections.Counter(article_one)
leastcommon = word counter.nsmallest(10)

然而，如果你需要一个无限数量的它们，例如，所有出现次数少于5次的单词，这可能在一次运行中是6，在下一次运行中是69105，那么你最好只是对列表进行排序：

wordcounter = collections.Counter(article_one)
allwords = sorted(wordcounter.items(), key=operator.itemgetter(1))
leastcommon = itertools.takewhile(lambda x: x[1] < 5, allwords)

排序比堆化花费更长的时间，但使用list提取前M个元素要比使用heap快得多。从算法上来说，它们之间的差异仅表现为一些log N因子，所以常数在这里非常重要。所以最好的方法是测试。

通过使用pastebin上的代码，以及通过执行cat reut2* >reut2.sgm创建的文件来对Reuters-21578语料库进行基准测试（没有处理文本提取，因此对于严肃的工作显然不是很好，但应该适合进行基准测试，因为SGML标记中没有最不常见的内容...）：

$ python leastwords.py reut2.sgm # Apple 2.7.2 64-bit
heap: 32.5963380337
sort: 22.9287009239
$ python3 leastwords.py reut2.sgm # python.org 3.3.0 64-bit
heap: 32.47026552911848
sort: 25.855643508024514
$ pypy leastwords.py reut2.sgm # 1.9.0/2.7.2 64-bit
heap: 23.95291996
sort: 16.1843900681

我尝试了各种方法来加速它们（包括：在生成器表达式周围使用takewhile而不是在堆版本中使用yield循环，使用nsmallest弹出乐观批次并且丢弃任何多余的元素，制作一个list并就地排序，装饰排序撤消而不是关键字，使用partial代替lambda等），但是它们中没有一种改进了超过5％（有些甚至使事情显着变慢）。

无论如何，这些都比我预期的更接近，所以我可能会选择更简单和更易读的那个。但是我认为排序在这方面击败了堆，因此...

再次强调：如果您只需要最不常见的N个元素，则对于合理的N，我愿意打赌，即使没有测试，堆实现也会获胜。

- abarnert

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wolph · Accepted Answer

使用defaultdict可以让代码更加简洁易懂，相比之下Counter需要Python 2.7及以上的版本才可用。以下是适用于Python 2.5及以上版本的代码：

import collections

counter = collections.defaultdict(int)
article_one = """ large body of text """

for word in article_one.split():
    counter[word] += 1

print sorted(counter.iteritems(), key=lambda x: x[::-1])