为文本挖掘创建词汇字典

Question

为文本挖掘创建词汇字典

3

我有以下代码：

train_set = ("The sky is blue.", "The sun is bright.")
test_set = ("The sun in the sky is bright.",
    "We can see the shining sun, the bright sun.")

现在我正在尝试像这样计算单词频率：

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer()

接下来我想要打印词汇表。因此我执行以下操作：

vectorizer.fit_transform(train_set)
print vectorizer.vocabulary

目前我得到的输出是“none”。然而我期望得到类似以下的结果：

{'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}

有没有想法出了什么问题？

- Frits Verstraten

1

可能是CountVectorizer不打印词汇表的重复问题。 - José Sánchez

2个回答

4

CountVectorizer不支持您要查找的内容。

您可以使用Counter类：

from collections import Counter

train_set = ("The sky is blue.", "The sun is bright.")
word_counter = Counter()
for s in train_set:
    word_counter.update(s.split())

print(word_counter)

提供

Counter({'is': 2, 'The': 2, 'blue.': 1, 'bright.': 1, 'sky': 1, 'sun': 1})

或者您可以使用nltk中的FreqDist：

from nltk import FreqDist

train_set = ("The sky is blue.", "The sun is bright.")
word_dist = FreqDist()
for s in train_set:
    word_dist.update(s.split())

print(dict(word_dist))

提供

{'blue.': 1, 'bright.': 1, 'is': 2, 'sky': 1, 'sun': 1, 'The': 2}

- Aris F.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- José Sánchez · Accepted Answer

我认为你可以尝试这个：

我觉得你可以尝试这个：

print vectorizer.vocabulary_