如何在Python中从语料库创建词云？

Question

如何在Python中从语料库创建词云？

pythonnltkcorpusgensimword-cloud

48

从在 R 中从语料库创建单词子集中，答案提供者可以轻松将一个term-document matrix转换为一个单词云。

是否有类似的Python库函数，可以将原始单词文本文件或NLTK语料库或Gensim Mmcorpus转换为单词云？

结果将看起来像这样： enter image description here

- alvas

1

经过一些疯狂的重新实现，这里是一个不那么“sklearn”的解决方案，它使用了Andreas Mueller的代码。https://github.com/alvations/translation-cloud - alvas

5个回答

12

amueller 代码示例

在命令行/终端中：

sudo pip install wordcloud

然后运行Python脚本：

## Simple WordCloud
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS 

text = 'all your base are belong to us all of your base base base'

def generate_wordcloud(text): # optionally add: stopwords=STOPWORDS and change the arg below
    wordcloud = WordCloud(font_path='/Library/Fonts/Verdana.ttf',
                          width=800, height=400,
                          relative_scaling = 1.0,
                          stopwords = {'to', 'of'} # set or space-separated string
                          ).generate(text)
    
    fig = plt.figure(1, figsize=(8, 4))
    plt.axis('off')
    plt.imshow(wordcloud)
    plt.axis("off")
    ## Pick One:
    # plt.show()
    plt.savefig("WordCloud.png")

generate_wordcloud(text)

- MyopicVisage

实际上，这是一个相当具有欺骗性的词云。由于它基于像素和单词长度进行了归一化，尽管计数相同，但这就是为什么“美国”比“base”更大的原因。 - alvas

请查看文档。可以更改绘图的停用词和相对缩放（在缩放单词时使用频率与排名之间的比例）。默认情况下，相对缩放为0（排名），我认为您正在寻找相对缩放= 1.0（频率）。 - MyopicVisage

1

你能把那个放到答案里吗？并且使用1.0生成不同的词云吗？谢谢！这将有助于未来的读者 =) - alvas

我想对停用词参数进行一个小的更正，即将其改为stopwords = {'to', 'of'}。 - StatguyUser

如何以高分辨率保存图像？ - Sigur

阅读amueller的代码第164-168行。来源：https://github.com/amueller/word_cloud/blob/master/wordcloud/wordcloud.py 如果您打算保存，您需要为画布的宽度和高度添加参数，并添加一个用于图形大小的行。 - MyopicVisage

10

如果您需要在网站或 Web 应用程序中显示这些词云，可以将数据转换为 JSON 或 CSV 格式，然后将其加载到 JavaScript 可视化库（如d3）中。 d3 上的词云

如果不需要，Marcin 的回答是实现所描述需求的一个好方法。

- valentinos

3

这里是简短的代码

#make wordcoud

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()


if __name__ == '__main__':

    show_wordcloud(text_str)

- Ujjawal107

0

cv = CountVectorizer()
cvData = cv.fit_transform(DF["W"]).toarray()
cvDF = pd.DataFrame(data=cvData,          columns=cv.get_feature_names())
cvDF["target"] = DF["T"]

def w_count(tar):
    MO = cvDF[cvDF["target"] == tar].drop("target",axis=1)
    x=[]
    y=[]
    for i in range(MO.shape[0]):
        for j in cvDF.drop("target",axis=1):
             if MO.iloc[i][j]>4:
                x.append(j)
                y.append(MO.iloc[i][j])
    return x,y

for i in cvDF["target"]:
    x,y = w_count(i)
    plt.figure(figsize=(10, 6))
    plt.title(i)
    plt.xticks(rotation="vertical")
    plt.bar(x,y)
    plt.show()

for c in range(len(DF)):
    w=[]
    for i,j in zip(cvDF.T[c].index, cvDF.T[c].values):
        a=[]
        if j > 1:
            a.append(i)
            a.append(j)
            w.append(a)
    pd.DataFrame(w)
    data = dict(w)
    wc = WordCloud(width=800, height=400, max_words=200).generate_from_frequencies(data)
    plt.figure(figsize=(10, 10))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(DF['T'][c])
    plt.show()

- Mike

pr = agc.fit_predict(features.toarray()) plt.figure(figsize=(10,10)) plt.scatter(pca_feat[:,0], pca_feat[:,1], c = brc_pred) - Mike

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- HeadAndTail · Accepted Answer

from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
stopwords = set(STOPWORDS)

def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color='white',
        stopwords=stopwords,
        max_words=200,
        max_font_size=40, 
        scale=3,
        random_state=1 # chosen at random by flipping a coin; it was heads
    ).generate(str(data))

    fig = plt.figure(1, figsize=(12, 12))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize=20)
        fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

show_wordcloud(Samsung_Reviews_Negative['Reviews'])
show_wordcloud(Samsung_Reviews_positive['Reviews'])