如何计算语料库文档中的单词数

Question

如何计算语料库文档中的单词数

4

我想知道在文档中计算单词数量的最佳方法。如果我已经设置了自己的“corp.txt”语料库，并且我想知道文件“corp.txt”中“students，trust，ayre”的出现频率，我应该使用什么呢？

以下哪一个方法是正确的：

....
full=nltk.Text(mycorpus.words('FullReport.txt'))
>>> fdist= FreqDist(full)
>>> fdist
<FreqDist with 34133 outcomes>
// HOW WOULD I CALCULATE HOW FREQUENTLY THE WORDS 
"students, trust, ayre" occur in full.

感谢您，Ray。

- Ray Hmar

1

这两个都不是标准的Python库提供的。你确定你不是想到了NLTK吗？ - Chris Eberle

看到你的名字，我会假装你知道“学生信任Ayre”的含义。无论如何，我会选择FreqDist。fdist = FreqDist(); for word in tokenize.whitespace(sent): fdist.inc(word.lower())。你可以在这里查看文档（http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html）。 - aayoubi

我编辑了答案，请帮我仔细检查一下。谢谢。 - Ray Hmar

可能是如何优化Python中的单词计数？的重复问题。 - alvas

4个回答

4

大多数人都会使用默认的字典（默认值为0）。每次看到一个单词，只需将其值增加一：

total = 0
count = defaultdict(lambda: 0)
for word in words:
    total += 1
    count[word] += 1

# Now you can just determine the frequency by dividing each count by total
for word, ct in count.items():
     print('Frequency of %s: %f%%' % (word, 100.0 * float(ct) / float(total)))

- Chris Eberle

你的意思是 defaultdict(int) -- defaultdict 需要一个可调用对象。 - kindall

@Chris 用使用Counter怎么样？ - alvas

4

你已经快完成了！你可以使用你感兴趣的单词对FreqDist进行索引。尝试以下操作：

print fdist['students']
print fdist['ayre']
print fdist['full']

这将给出每个单词的计数或出现次数。您说“多频繁” - 频率与出现次数不同 - 可以这样得到：

print fdist.freq('students')
print fdist.freq('ayre')
print fdist.freq('full')

- Spaceghost

1

你可以读取一个文件，然后将单独的标记令牌化并放入 NLTK 中的 FreqDist 对象中。请参见http://nltk.googlecode.com/svn/trunk/doc/api/nltk.probability.FreqDist-class.html。

from nltk.probability import FreqDist
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = FreqDist()
with open('test.txt', 'r') as fin:
    for word in word_tokenize(fin.read()):
        fdist.inc(word)

print "'blah' occurred", fdist['blah'], "times"

[out]:

'blah' occurred 3 times

此外，您还可以使用来自collections的本地Counter对象，以获得相同的计数，请参见https://docs.python.org/2/library/collections.html。请注意，FreqDist或Counter对象中的键区分大小写，因此您可能还需要将令牌转换为小写：

from collections import Counter
from nltk import word_tokenize

# Creates a test file for reading.
doc = "this is a blah blah foo bar black sheep sentence. Blah blah!"
with open('test.txt', 'w') as fout:
    fout.write(doc)

# Reads a file into FreqDist object.
fdist = Counter()
with open('test.txt', 'r') as fin:
    fdist.update(word_tokenize(fin.read().lower()))

print "'blah' occurred", fdist['blah'], "times"

- alvas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Lars GJ · Accepted Answer

我建议您研究collections.Counter。特别是对于大量文本，它可以胜任工作，并且仅受可用内存的限制。它在拥有12GB RAM的计算机上一天半内计算了300亿个标记。伪代码（变量Words在实践中将是对文件或类似内容的某些引用）：

from collections import Counter
my_counter = Counter()
for word in Words:
    my_counter.update(word)

完成后，这些单词会被存储在一个名为my_counter的字典中，然后可以将其写入磁盘或存储在其他地方（例如sqlite）。