有没有更高效的方法来查找最常见的n-grams？

Question

有没有更高效的方法来查找最常见的n-grams？

13

我正在尝试从一个大语料库中找出前 k 个最常见的 n-grams。我看到很多地方都建议使用朴素的方法 - 简单地扫描整个语料库并保留所有 n-grams 的计数字典。有没有更好的方法来做这件事？

- bendl

可能是 http://cs.stackexchange.com/questions/8972/optimal-algorithm-for-finding-all-ngrams-from-a-pre-defined-set-in-a-text 的重复问题。 - pltrdy

你在对比什么呢？语料库有多大？我认为你可以很快地用C++来计算庞大的语料库中ngram的数量，即使在Python中也非常快速 =) - alvas

你是指字符ngram还是词ngram？ - alvas

我正在使用单词ngram，但我想字符ngram也可以泛化。至于语料库，这应该能够适应大约20GB左右的语料库，并在Hadoop集群上运行。 - bendl

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alvas · Accepted Answer

在Python中使用NLTK：

$ wget http://norvig.com/big.txt
$ python
>>> from collections import Counter
>>> from nltk import ngrams
>>> bigtxt = open('big.txt').read()
>>> ngram_counts = Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
[(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]

在Python中，原生支持N-gram算法（参见Python中快速/优化的N-gram实现）：

>>> import collections
>>> def ngrams(text, n=2):
...     return zip(*[text[i:] for i in range(n)])
>>> ngram_counts = collections.Counter(ngrams(bigtxt.split(), 2))
>>> ngram_counts.most_common(10)
    [(('of', 'the'), 12422), (('in', 'the'), 5741), (('to', 'the'), 4333), (('and', 'the'), 3065), (('on', 'the'), 2214), (('at', 'the'), 1915), (('by', 'the'), 1863), (('from', 'the'), 1754), (('of', 'a'), 1700), (('with', 'the'), 1656)]

在Julia中，参见使用Julia生成ngram。

import StatsBase: countmap
import Iterators: partition
bigtxt = readstring(open("big.txt"))
ngram_counts = countmap(collect(partition(split(bigtxt), 2, 1)))

大致时间：

$ time python ngram-test.py # With NLTK.

real    0m3.166s
user    0m2.274s
sys 0m0.528s

$ time python ngram-native-test.py 

real    0m1.521s
user    0m1.317s
sys 0m0.145s

$ time julia ngram-test.jl 

real    0m3.573s
user    0m3.188s
sys 0m0.306s