Python:词云,重复单词

23

在词云中,我有一些重复的单词,但我不明白为什么它们没有被视为同一个单词计算,并显示为一个单词。

from wordcloud import WordCloud
word_string = 'oh oh oh oh oh oh verse wrote book stand title book would life superman thats make feel count privilege love ideal honored know feel see everyday things things say rock baby truth rock love rock rock everything need rock baby rock wanna kiss ya feel ya please ya right wanna touch ya love ya baby night reward ya things rock love rock love rock oh oh oh verse try count ways make smile id run fingers run timeless things talk sugar keeps going make wanna keep lovin strong make wanna try best give want need give whole heart little piece minimum talking everything single wish talking every dream rock baby truth rock love rock rock everything need rock baby rock wanna kiss ya feel ya please ya right wanna touch ya love ya baby night reward ya things rock love rock wanna rock bridge theres options dont want theyre worth time cause oh thank like us fine rock sand smile cry joy pain truth lies matter know count oh oh oh oh oh oh rock baby truth rock love rock rock everything need rock baby rock wanna kiss ya feel ya please ya right wanna touch ya love ya baby night reward ya things rock love rock love rock oh oh oh oh oh oh wanna kiss ya feel ya please ya right wanna touch ya love ya baby night reward ya things rock love rock wanna rock party people people party popping sitting around see looking looking see look started lets hook little one one come give stuff let freshin ruff lets go lets hook start wont stop baby baby dont stop come give stuff lets go black culture black culture black culture black culture party people people party popping sitting around see looking looking see look started lets hook come one give stuff let freshin little one one ruff lets go lets hook start wont stop baby baby dont stop come give stuff lets go black culture black culture black culture black culture lets hook come give stuff let freshin little one one ruff lets go lets hook start wont stop baby baby dont stop come give stuff lets go lets hook come give stuff let freshin little one one ruff lets go lets hook start wont stop baby baby dont stop come give stuff lets go black culture black culture black culture black culture black culture black culture black culture black culture'
wordcloud = WordCloud(background_color="white",
                          width=1200, height=1000,
                          stopwords=STOPWORDS
                         ).generate(word_string)
plt.imshow(wordcloud)

当你看到像爱、哦、摇滚、黑色和文化这样的词语出现了多次,但它们似乎没有被计算在一起。我做错了什么吗?

在此输入图片描述


你只是想从输入字符串中删除重复项吗?像这样? - Olian04
我不想删除重复的单词。词云的目的是查看文本中有哪些单词以及它们的出现次数。词云将显示最常见的单词,字体会更大,而较不常见的单词则会用较小的字体书写。因此,您可以看到“ya”这个单词非常频繁。但我不明白为什么它会显示重复的单词。 - Alina
啊哈,那我就无能为力了。祝你好运。 - Olian04
2个回答

95

word_cloud项目中有一个称为“collocations”的功能。您可以通过设置collocations=False来关闭它,就像这样:

    wordcloud = WordCloud(collocations=False).generate(word_string)

这将消除文本中经常组合在一起的词语。它会消除一些你可能不喜欢的东西,比如“噢噢”,同时也会消除一些你可能喜欢的东西,比如“黑人文化”。


2
这是解决方案,应该被勾选。 - Ceren
@craigching 你好。我怎样才能只打印出colocaciones(而不是单个术语)?谢谢! - user140259

14
如果你查看wordcloud.words_,你会发现词频表中包括一些双词短语,如“oh oh”,“hook start”,“lets go”,“lets hook”。
你需要深入了解.process_text()背后的代码,才能确定其具体原因。
作为一种解决方法,你可以自己将word_string拆分以构建单词频率表,然后使用.generate_from_frequencies()创建图像。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接