NLTK - 自动翻译相似单词

7

大局目标:我正在使用NLTK和Gensim在Python中制作产品评论的LDA模型。我想对不同的n-grams运行此模型。

问题:当我使用bigrams运行时,与unigrams相比都很棒,但我开始得到一些带有重复信息的主题。例如,主题1可能包含:['good product', 'good value'],而主题4可能包含:['great product', 'great value']。对于人类来说,这些显然传达了相同的信息,但是显然'good product''great product'是不同的bigrams。如何算法地确定'good product''great product'足够相似,以便我可以将它们的所有出现都转换为另一个(也许是在语料库中更经常出现的那个)?

我尝试过的:我尝试了WordNet的Synset树,并没有什么好结果。事实证明,good是一个“形容词”,但great是一个“形容词卫星”,因此返回路径相似性的值为None。我的思路是:

  1. 对句子进行词性标注
  2. 使用这些POS找到正确的Synset
  3. 计算两个Synset的相似度
  4. 如果它们高于某个阈值,则计算两个单词的出现次数
  5. 将出现最少的单词替换为出现最多的单词

但是,理想情况下,我希望能够确定goodgreat在我的语料库中是相似的(或许是以共同出现的方式),以便将其扩展到不是英语常见单词但出现在我的语料库中的单词上,并且可以将其扩展到n-grams(也许在我的语料库中Oracleterrible是同义词,或者feature engineeringfeature creation相似)。

有关算法的任何建议或使WordNet synset正常工作的建议?


1
这些对我来说并没有传达相同的信息。"Great"比"good"更强烈。此外,"good value"意味着产品在其质量水平上具有吸引力的价格。"Good product"则表示该产品是高质量的。最新的Mac Pro看起来像是一款很棒的产品,但我不会称其为物超所值。一种方法是询问将"great"替换为"good"或将"value"替换为"product"是否实际改变了某个感兴趣的结果。 - ChrisP
@ChrisP - 我理解你的观点。但是这里有两个主题的例子:主题1- ['优质服务','良好产品','实惠价格'],主题2- ['良好服务','高品质产品','合理价格']。没有人会将它们标记为两个不同的主题,而且当存在其他可能被视为主题的内容时,这并不实用。良好产品优质产品都以积极的方式描述产品,您可以从易于参考的角度看出它们更相似。 - user2979931
@user2979931,这些答案中有没有回答到你的问题? - alvas
2个回答

2
如果您要使用WordNet,您会遇到以下问题:
问题1:词义消歧(WSD),即如何自动确定使用哪个同义词集?
>>> for i in wn.synsets('good','a'):
...     print i.name, i.definition
... 
good.a.01 having desirable or positive qualities especially those suitable for a thing specified
full.s.06 having the normally expected amount
good.a.03 morally admirable
estimable.s.02 deserving of esteem and respect
beneficial.s.01 promoting or enhancing well-being
good.s.06 agreeable or pleasing
good.s.07 of moral excellence
adept.s.01 having or showing knowledge and skill and aptitude
good.s.09 thorough
dear.s.02 with or in a close or intimate relationship
dependable.s.04 financially sound
good.s.12 most suitable or right for a particular purpose
good.s.13 resulting favorably
effective.s.04 exerting force or influence
good.s.15 capable of pleasing
good.s.16 appealing to the mind
good.s.17 in excellent physical condition
good.s.18 tending to promote physical well-being; beneficial to health
good.s.19 not forged
good.s.20 not left to spoil
good.s.21 generally admired

>>> for i in wn.synsets('great','a'):
...     print i.name, i.definition
... 
great.s.01 relatively large in size or number or extent; larger than others of its kind
great.s.02 of major significance or importance
great.s.03 remarkable or out of the ordinary in degree or magnitude or effect
bang-up.s.01 very good
capital.s.03 uppercase
big.s.13 in an advanced stage of pregnancy

假设你已经以某种方式得出了正确的意思,也许你尝试过像这样(https://github.com/alvations/pywsd),并且假设你成功获取了词性和同义词集:

good.a.01 具有令人满意或积极的特质,特别是适用于指定事物的特质。 great.s.01 相对较大,数量或范围较大;比同类中的其他对象更大。

问题2: 如何比较这两个同义词集?

让我们尝试相似度函数,但你会发现它们没有给你任何分数:

>>> good = wn.synsets('good','a')[0]
>>> great = wn.synsets('great','a')[0]
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))
None
>>> print max(wn.wup_similarity(good,great), wn.wup_similarity(great, good))

>>> print max(wn.res_similarity(good,great,semcor_ic), wn.res_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1312, in res_similarity
    return synset1.res_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 738, in res_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.jcn_similarity(good,great,semcor_ic), wn.jcn_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1316, in jcn_similarity
    return synset1.jcn_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 759, in jcn_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1643, in _lcs_ic
    (synset1, synset2))
nltk.corpus.reader.wordnet.WordNetError: Computing the least common subsumer requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.
>>> print max(wn.lch_similarity(good,great), wn.lch_similarity(great, good))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1304, in lch_similarity
    return synset1.lch_similarity(synset2, verbose, simulate_root)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 638, in lch_similarity
    (self, other))
nltk.corpus.reader.wordnet.WordNetError: Computing the lch similarity requires Synset('good.a.01') and Synset('great.s.01') to have the same part of speech.

让我们尝试不同的同义词集合,因为“good”既有“卫星形容词”,也有“形容词”,而“great”只有“卫星”,让我们选择最低共同分母:
good.s.13 resulting favorably
great.s.01 relatively large in size or number or extent; larger than others of its kind

您意识到仍然没有相似性信息可用于比较“卫星形容词”之间的差异。
>>> print max(wn.lin_similarity(good,great,semcor_ic), wn.lin_similarity(great, good,semcor_ic))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1320, in lin_similarity
    return synset1.lin_similarity(synset2, ic, verbose)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 789, in lin_similarity
    ic1, ic2, lcs_ic = _lcs_ic(self, other, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1645, in _lcs_ic
    ic1 = information_content(synset1, ic)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1666, in information_content
    raise WordNetError(msg % synset.pos)
nltk.corpus.reader.wordnet.WordNetError: Information content file has no entries for part-of-speech: s
>>> print max(wn.path_similarity(good,great), wn.path_similarity(great, good))None
None

现在似乎WordNet正在创造更多问题,而不是解决任何问题。让我们尝试另一种方法,尝试单词聚类,请参见链接:http://en.wikipedia.org/wiki/Word-sense_induction 我也放弃回答OP提出的广泛和开放的问题,因为在聚类中有很多自动化的东西对于像我这样的平凡人来说太复杂了 =)

0

您说(重点加粗):

理想情况下,我希望有一种算法能够确定在我的语料库中,good和great是相似的(也许是在共现意义上)

您可以通过测量这些单词与其他单词在同一句子中出现的频率(即共现)来衡量单词之间的相似性。为了捕捉更多的语义相关性,可能您还可以捕捉到搭配词(collocations),即单词在单词附近的窗口中出现的频率。

这篇论文涉及到词义消歧(WSD),并使用搭配词和周围单词(共现)作为其特征空间的一部分。结果非常好,所以我想您可以将相同的特征用于解决您的问题。

在Python中,您可以使用sklearn,特别是您可能需要查看SVM(带有示例代码)来帮助您入门。

大致思路如下:

  1. 获取您想要检查相似性的一对二元组
  2. 使用您的语料库,为每个二元组生成搭配和共现特征
  3. 训练 SVM 学习第一个二元组的特征
  4. 在其他二元组的出现情况上运行 SVM(您会得到一些分数)
  5. 可能使用这些分数来确定两个二元组是否相似

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接