使用Gensim获取三元组时出现的问题

14
我希望从我提到的示例句子中获取二元组和三元组。
我的代码对于二元组很好用。然而,它无法捕获数据中的三元组(例如,人类计算机交互,在我的句子中被提及了5次)。
下面是使用Gensim中的Phrases的代码,方法1如下所述。
from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=1, delimiter=b' ')
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

方法2 我甚至尝试使用Phraser和Phrases,但它没有起作用。

from gensim.models import Phrases
from gensim.models.phrases import Phraser
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, threshold=2, delimiter=b' ')
bigram_phraser = Phraser(bigram)
trigram = Phrases(bigram_phraser[sentence_stream])

for sent in sentence_stream:
    bigrams_ = bigram_phraser[sent]
    trigrams_ = trigram[bigrams_]

    print(bigrams_)
    print(trigrams_)

请帮我解决获取三元组的问题。
我正在遵循Gensim的示例文档
1个回答

20
我通过对您的代码进行少量修改,成功获取了双字节和三字节组合:

我能够对您的代码进行一些修改,以获得双字节和三字节组合:

from gensim.models import Phrases
documents = ["the mayor of new york was there", "human computer interaction and machine learning has now become a trending research area","human computer interaction is interesting","human computer interaction is a pretty interesting subject", "human computer interaction is a great and new subject", "machine learning can be useful sometimes","new york mayor was present", "I love machine learning because it is a new subject area", "human computer interaction helps people to get user friendly applications"]
sentence_stream = [doc.split(" ") for doc in documents]

bigram = Phrases(sentence_stream, min_count=1, delimiter=b' ')
trigram = Phrases(bigram[sentence_stream], min_count=1, delimiter=b' ')

for sent in sentence_stream:
    bigrams_ = [b for b in bigram[sent] if b.count(' ') == 1]
    trigrams_ = [t for t in trigram[bigram[sent]] if t.count(' ') == 2]

    print(bigrams_)
    print(trigrams_)

我从bigram Phrases中移除了threshold = 1参数,因为否则它似乎会形成奇怪的二元组,从而允许构建奇怪的三元组(请注意,bigram用于构建trigram Phrases);当您有更多数据时,该参数可能会很有用。对于trigrams,还需要指定min_count参数,因为如果未提供,默认值为5。
为了检索每个文档的bigrams和trigrams,您可以使用此列表理解技巧来过滤分别由两个或三个单词组成的元素。
编辑-关于threshold参数的一些细节:
该参数由估算器用于确定两个单词a和b是否形成短语,仅当:
(count(a followed by b) - min_count) * N/(count(a) * count(b)) > threshold

其中N是总词汇量。默认情况下,参数值为10(请参见文档)。因此,阈值越高,单词形成短语的限制就越大。

例如,在您的第一种方法中,您尝试使用threshold = 1,因此您会得到['human computer', 'interaction is']作为3个句子的二元组,这些句子以“人机交互”开头;奇怪的第二个二元组是更宽松的阈值的结果。

然后,当您尝试使用默认的threshold = 10获取三元组时,对于这三个句子,您只会得到['human computer interaction is'],而对于剩余的两个句子则什么也没有(被阈值过滤掉);因为那是一个四元组,而不是三元组,所以它也会被if t.count(' ') == 2过滤掉。例如,如果您将三元组阈值降低到1,则可以获得两个剩余句子的['human computer interaction']作为三元组。似乎很难得到好的参数组合,这里有更多信息。

我不是专家,所以请谨慎处理这个结论:我认为最好先获得良好的二元组结果(而不是像“interaction is”这样的奇怪二元组),然后再继续进行,因为奇怪的二元组可能会给进一步的三元组、四元组等增加混乱。


1
非常感谢您宝贵的回答。干杯! :) 顺便问一下,您能告诉我threshold值会发生什么吗?因为对我来说不是很清楚。 - user8566323
2
不用谢!是的,我编辑了答案,希望现在更清晰了一些。 - stjernaluiht
1
非常感谢!我觉得你的答案非常有用 :) - user8566323
1
gensim中并不明显,delimiter=b' '必须是二进制格式。感谢提醒。 - Max
1
如何在训练和测试数据中使用它?它没有像scikit learn向量化器那样的fit和transform方法。 - user_12

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接