NLTK - 统计二元组频率

Question

NLTK - 统计二元组频率

20

这是一个关于Python和NLTK的新手问题。

我想要找到一起出现10次以上并且具有最高PMI的二元组频率。

为此，我正在使用以下代码：

def get_list_phrases(text):

    tweet_phrases = []

    for tweet in text:
        tweet_words = tweet.split()
        tweet_phrases.extend(tweet_words)


    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tweet_phrases,window_size = 13)
    finder.apply_freq_filter(10)
    finder.nbest(bigram_measures.pmi,20)  

    for k,v in finder.ngram_fd.items():
      print(k,v)

然而，这并不限制结果为前20个。我看到结果中有频率＜10的内容。我对Python世界还很陌生。

请问有人能指出如何修改代码以仅获取前20个结果吗？

谢谢。

- jainp

你的输入是什么？“text”是什么？ - alvas

我认为他的意思是前20个PMI分数最高的。对吗？请看下面我的解释。 - alvas

嗨，阿尔瓦斯，是的，我指的是按PMI得分排名前20的。我想先按频率筛选它们，然后再找出前20个PMI最高的。 - jainp

1

@user823743 你好，我想看看如何解决它。 - jainp

3

@jainp 你好，你看过我的回答吗？它回答了你的问题吗？它基于词组频率进行过滤，并根据你想要的PMI（Pointwise Mutual Information）指标对它们进行排名。 - user823743

显示剩余4条评论

2个回答

-2

请仔细阅读http://nltk.googlecode.com/svn/trunk/doc/howto/collocations.html中的教程，以了解在NLTK中使用collocation函数的更多用法，以及https://en.wikipedia.org/wiki/Pointwise_mutual_information中的数学知识。希望以下脚本能够帮助您，因为您的代码问题没有指定输入内容。

# This is just a fancy way to create document. 
# I assume you have your texts in a continuous string format
# where each sentence ends with a fullstop.
>>> from itertools import chain
>>> docs = ["this is a sentence", "this is a foo bar", "you are a foo bar", "yes , i am"]
>>> texts = list(chain(*[(j+" .").split() for j in [i for i in docs]]))

# This is the NLTK part
>>> from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder>>> bigram_measures= BigramAssocMeasures()
>>> finder  BigramCollocationFinder.from_words(texts)
# This gets the top 20 bigrams according to PMI
>>> finder.nbest(bigram_measures.pmi,20)
[(',', 'i'), ('i', 'am'), ('yes', ','), ('you', 'are'), ('foo', 'bar'), ('this', 'is'), ('a', 'foo'), ('is', 'a'), ('a', 'sentence'), ('are', 'a'), ('bar', '.'), ('.', 'yes'), ('.', 'you'), ('am', '.'), ('sentence', '.'), ('.', 'this')]

PMI通过计算log ( p(x|y) / p(x) )来衡量两个单词之间的关联性，因此它不仅与单词出现频率或一组单词同时出现有关。要获得高PMI，需要同时满足以下两点：

高p(x|y)
低p(x)

下面是一些极端的PMI示例。

假设语料库中有100个单词，如果某个单词X的频率为1，并且它只与另一个单词Y同时出现一次，那么：

p(x|y) = 1
p(x) = 1/100
PMI = log(1 / 1/100) = log 0.01 = -2

假设您在语料库中有100个单词，如果某个单词的频率为90，但它从未与另一个单词Y一起出现，则PMI为。

p(x|y) = 0
p(x) = 90/100
PMI = log(0 / 90/100) = log 0 = -infinity

因此，在这种意义上，第一个情况是 >>> X、Y之间的PMI比第二个情况高，尽管第二个单词的频率非常高。

- alvas

我的文本是由句点分隔的行。我想要做的是找到一起出现10次或更多次的二元组。然后使用这个结果基于PMI进行过滤。我可以单独完成它们，但我的问题是如何将它们联系在一起。 - jainp

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user823743 · Accepted Answer

问题出在您尝试使用apply_freq_filter的方式上。我们正在讨论词组搭配。正如您所知，词组搭配是关于词之间依赖关系的。 BigramCollocationFinder类继承自名为AbstractCollocationFinder的类，并且函数apply_freq_filter属于此类。如果其他一些函数尝试访问列表，则apply_freq_filter不应完全删除某些词组搭配，而是提供一个经过筛选的搭配列表。

那么这是为什么呢？想象一下，如果过滤搭配只是简单地删除它们，那么就会有许多概率度量（例如似然比或PMI本身）无法在从给定语料库中随机删除单词后适当地工作。通过从给定的单词列表中删除一些搭配，许多潜在的功能和计算将被禁用。此外，在删除之前进行所有这些测量将带来巨大的计算开销，而用户最终可能并不需要。

现在问题是如何正确使用apply_freq_filter函数？有几种方法。接下来我将展示问题及其解决方案。

让我们定义一个示例语料库并将其拆分为单词列表，类似于您所做的操作：

tweet_phrases = "I love iphone . I am so in love with iphone . iphone is great . samsung is great . iphone sucks. I really really love iphone cases. samsung can never beat iphone . samsung is better than apple"
from nltk.collocations import *
import nltk

出于实验目的，我将窗口大小设为3：

finder = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)
finder1 = BigramCollocationFinder.from_words(tweet_phrases.split(), window_size = 3)

请注意，为了进行比较，我仅在finder1上使用了过滤器：

finder1.apply_freq_filter(2)
bigram_measures = nltk.collocations.BigramAssocMeasures()

现在如果我写:

for k,v in finder.ngram_fd.items():
  print(k,v)

输出结果为:

(('.', 'is'), 3)
(('iphone', '.'), 3)
(('love', 'iphone'), 3)
(('.', 'iphone'), 2)
(('.', 'samsung'), 2)
(('great', '.'), 2)
(('iphone', 'I'), 2)
(('iphone', 'samsung'), 2)
(('is', '.'), 2)
(('is', 'great'), 2)
(('samsung', 'is'), 2)
(('.', 'I'), 1)
(('.', 'am'), 1)
(('.', 'sucks.'), 1)
(('I', 'am'), 1)
(('I', 'iphone'), 1)
(('I', 'love'), 1)
(('I', 'really'), 1)
(('I', 'so'), 1)
(('am', 'in'), 1)
(('am', 'so'), 1)
(('beat', '.'), 1)
(('beat', 'iphone'), 1)
(('better', 'apple'), 1)
(('better', 'than'), 1)
(('can', 'beat'), 1)
(('can', 'never'), 1)
(('cases.', 'can'), 1)
(('cases.', 'samsung'), 1)
(('great', 'iphone'), 1)
(('great', 'samsung'), 1)
(('in', 'love'), 1)
(('in', 'with'), 1)
(('iphone', 'cases.'), 1)
(('iphone', 'great'), 1)
(('iphone', 'is'), 1)
(('iphone', 'sucks.'), 1)
(('is', 'better'), 1)
(('is', 'than'), 1)
(('love', '.'), 1)
(('love', 'cases.'), 1)
(('love', 'with'), 1)
(('never', 'beat'), 1)
(('never', 'iphone'), 1)
(('really', 'iphone'), 1)
(('really', 'love'), 1)
(('samsung', 'better'), 1)
(('samsung', 'can'), 1)
(('samsung', 'great'), 1)
(('samsung', 'never'), 1)
(('so', 'in'), 1)
(('so', 'love'), 1)
(('sucks.', 'I'), 1)
(('sucks.', 'really'), 1)
(('than', 'apple'), 1)
(('with', '.'), 1)
(('with', 'iphone'), 1)

如果我对于 finder1 也写同样的代码，然后运行得到了相同的结果。所以乍一看筛选器好像没起作用。但是你可以看到它确实起了作用：诀窍在于使用 score_ngrams。

如果我对于 finder 使用 score_ngrams，它会变成：

finder.score_ngrams (bigram_measures.pmi)

输出如下：

[(('am', 'in'), 5.285402218862249), (('am', 'so'), 5.285402218862249), (('better', 'apple'), 5.285402218862249), (('better', 'than'), 5.285402218862249), (('can', 'beat'), 5.285402218862249), (('can', 'never'), 5.285402218862249), (('cases.', 'can'), 5.285402218862249), (('in', 'with'), 5.285402218862249), (('never', 'beat'), 5.285402218862249), (('so', 'in'), 5.285402218862249), (('than', 'apple'), 5.285402218862249), (('sucks.', 'really'), 4.285402218862249), (('is', 'great'), 3.7004397181410926), (('I', 'am'), 3.7004397181410926), (('I', 'so'), 3.7004397181410926), (('cases.', 'samsung'), 3.7004397181410926), (('in', 'love'), 3.7004397181410926), (('is', 'better'), 3.7004397181410926), (('is', 'than'), 3.7004397181410926), (('love', 'cases.'), 3.7004397181410926), (('love', 'with'), 3.7004397181410926), (('samsung', 'better'), 3.7004397181410926), (('samsung', 'can'), 3.7004397181410926), (('samsung', 'never'), 3.7004397181410926), (('so', 'love'), 3.7004397181410926), (('sucks.', 'I'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'am'), 2.9634741239748865), (('.', 'sucks.'), 2.9634741239748865), (('beat', '.'), 2.9634741239748865), (('with', '.'), 2.9634741239748865), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('I', 'really'), 2.7004397181410926), (('beat', 'iphone'), 2.7004397181410926), (('great', 'samsung'), 2.7004397181410926), (('iphone', 'cases.'), 2.7004397181410926), (('iphone', 'sucks.'), 2.7004397181410926), (('never', 'iphone'), 2.7004397181410926), (('really', 'love'), 2.7004397181410926), (('samsung', 'great'), 2.7004397181410926), (('with', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('I', 'love'), 2.115477217419936), (('iphone', '.'), 1.963474123974886), (('great', 'iphone'), 1.7004397181410922), (('iphone', 'great'), 1.7004397181410922), (('really', 'iphone'), 1.7004397181410922), (('.', 'iphone'), 1.37851162325373), (('.', 'I'), 1.37851162325373), (('love', '.'), 1.37851162325373), (('I', 'iphone'), 1.1154772174199366), (('iphone', 'is'), 1.1154772174199366)]

现在注意当我对筛选到2个频率的finder1进行相同计算时会发生什么：

finder1.score_ngrams(bigram_measures.pmi)

和输出：

[(('is', 'great'), 3.7004397181410926), (('samsung', 'is'), 3.115477217419936), (('.', 'is'), 2.963474123974886), (('great', '.'), 2.963474123974886), (('love', 'iphone'), 2.7004397181410926), (('.', 'samsung'), 2.37851162325373), (('is', '.'), 2.37851162325373), (('iphone', 'I'), 2.1154772174199366), (('iphone', 'samsung'), 2.1154772174199366), (('iphone', '.'), 1.963474123974886), (('.', 'iphone'), 1.37851162325373)]

请注意，所有出现次数小于2的搭配都不在此列表中；这正是您要寻找的结果。因此过滤器已经起作用了。此外，文档对此问题给出了最少的提示。

希望这回答了您的问题。否则，请告诉我。

免责声明：如果您主要处理的是推文，窗口大小为13就太大了。如果您注意到，在我的样本语料库中，我的样本推文大小太小，应用窗口大小为13会导致找到与主题无关的搭配。