使用Python生成双词语的词云

5

我正在使用Python中的Wordcloud包直接从文本文件生成词云。以下是我从StackOverflow中重新使用的代码:

import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS


def random_color_func(word=None, font_size=None, position=None, orientation=None, font_path=None, random_state=None):
    h = int(360.0 * 45.0 / 255.0)
    s = int(100.0 * 255.0 / 255.0)
    l = int(100.0 * float(random_state.randint(60, 120)) / 255.0)

    return "hsl({}, {}%, {}%)".format(h, s, l)

file_content=open ("xyz.txt").read()

wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
                            stopwords = STOPWORDS,
                            background_color = 'white',
                            width = 1200,
                            height = 1000,
                            color_func = random_color_func
                            ).generate(file_content)

plt.imshow(wordcloud,interpolation="bilinear")
plt.axis('off')
plt.show()

我得到的是单词云。在WordCloud()函数中,是否有参数可以传递n-gram而不需要格式化文本文件。

我想要bigram的词云。或者在显示中使用下划线连接的单词。例如:machine_learning(Machine和Learning将是2个不同的单词)


嗯...不要使用 file_content,使用一些不同的东西,你可以通过处理file_content获得。 - mkrieger1
如果您使用的是这个WordCloud接受一个正则表达式,该表达式会影响输入文本的分词。 - mkrieger1
1
如何传递双字母正则表达式? - DreamerP
3个回答

11

通过将WordCloud中的collocation_threshold参数值降低,可以轻松生成Bigram词云。

编辑词云:

wordcloud = WordCloud(font_path = r'C:\Windows\Fonts\Verdana.ttf',
                            stopwords = STOPWORDS,
                            background_color = 'white',
                            width = 1200,
                            height = 1000,
                            color_func = random_color_func,
                            collocation_threshold = 3               --added this to your question code, try changing this value between 1-50
                            ).generate(file_content)

了解更多信息请参考:

collocation_threshold:整数,默认值为30。只有具有Dunning likelihood共现分数大于此参数的二元组才会被视为二元组。默认值30是任意选择的。

您还可以在此找到wordcloud.WordCloud的源代码:https://amueller.github.io/word_cloud/_modules/wordcloud/wordcloud.html


6

感谢Diego的回答。这里是Diego的回答的Python代码延续。

import nltk
from wordcloud import WordCloud, STOPWORDS

WNL = nltk.WordNetLemmatizer()
text = 'your input text goes here'
# Lowercase and tokenize
text = text.lower()
# Remove single quote early since it causes problems with the tokenizer.
text = text.replace("'", "")
# Remove numbers from text
remove_digits = str.maketrans('', '', digits)
text = text.translate(remove_digits)
tokens = nltk.word_tokenize(text)
text1 = nltk.Text(tokens)

# Remove extra chars and remove stop words.
text_content = [''.join(re.split("[ .,;:!?‘’``''@#$%^_&*()<>{}~\n\t\\\-]", word)) for word in text1]

#set the stopwords list
stopwords_wc = set(STOPWORDS)
customised_words = ['xxx', 'yyy'] # If you want to remove any particular word form text which does not contribute much in meaning

new_stopwords = stopwords_wc.union(customized_words)
text_content = [word for word in text_content if word not in new_stopwords]

# After the punctuation above is removed it still leaves empty entries in the list.
text_content = [s for s in text_content if len(s) != 0]

# Best to get the lemmas of each word to reduce the number of similar words
text_content = [WNL.lemmatize(t) for t in text_content]

nltk_tokens = nltk.word_tokenize(text)  
bigrams_list = list(nltk.bigrams(text_content))
print(bigrams_list)
dictionary2 = [' '.join(tup) for tup in bigrams_list]
print (dictionary2)

#Using count vectoriser to view the frequency of bigrams
vectorizer = CountVectorizer(ngram_range=(2, 2))
bag_of_words = vectorizer.fit_transform(dictionary2)
vectorizer.vocabulary_
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
print (words_freq[:100])

#Generating wordcloud and saving as jpg image
words_dict = dict(words_freq)
WC_height = 1000
WC_width = 1500
WC_max_words = 200
wordCloud = WordCloud(max_words=WC_max_words, height=WC_height, width=WC_width,stopwords=new_stopwords)
wordCloud.generate_from_frequencies(words_dict)
plt.title('Most frequently occurring bigrams connected by same colour and font size')
plt.imshow(wordCloud, interpolation='bilinear')
plt.axis("off")
plt.show()
wordCloud.to_file('wordcloud_bigram.jpg')

1
谢谢您提供这段代码。需要进行一些更正 - 定制单词拼写有一次用了 'z',另一次用了 's'。您还需要添加这个 'from string import digits`。 - saujosai

3

您应该使用一个向量化器 = CountVectorizer(ngram_range =(2,2)),以获取频率,然后使用wordcloud中的.generate_from_frequencies方法


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接