sklearn中的向量化器fit_transform如何工作？

Question

sklearn中的向量化器fit_transform如何工作？

11

我正在尝试理解以下代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我尝试打印X以查看返回结果时，我得到了以下结果：

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

然而，我不理解这个结果的含义？

- Leo

3

这是稀疏矩阵的结果。使用X.toarray()将其转换为密集型矩阵，然后打印即可。 - Vivek Kumar

1

但是这里的数字含义是什么呢？例如：“（3,6）1”。你能详细解释一下吗？ - Leo

1

在稀疏矩阵中，大多数条目都是零，因此不存储以节省内存。括号中的数字是矩阵中值的索引（行、列），1 是该值（表示由矩阵行表示的文档中术语出现的次数）。 - Vivek Kumar

如果“1”是术语在文档中出现的次数，那么为什么在第一个文档中，“the”出现了2次，但所有位置（从（0,1）到（0.8））的值都相同为1？ - Leo

也许“the”是停用词，没有包含在学习的词汇表中。请通过打印vectorizer.get_feature_names()来检查实际使用索引的词汇单词。 - Vivek Kumar

我明白了，它显示了单词列表，其中索引和实际上“第二次”出现了两次。非常感谢你，你真的救了我的一天 :) - Leo

3个回答

3

你可以把它解释为“(句子索引，特征索引)计数”，因为有三个句子：从0开始，到2结束。特征索引是单词索引，可以从vectorizer.vocabulary_中获取。-> vocabulary_是一个字典{word:feature_index,...}，因此对于示例(0,1) 1。

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

如果您使用tfidf向量化器（请参见此处），而不是使用计数向量化器，则会得到tfidf值。希望我已经表述清楚了。

- Himanshu Kriplani

-2

它将文本转换为数字。因此，使用其他函数，您将能够计算给定数据集中每个单词存在的次数。我是编程新手，所以可能还有其他领域可以使用。

- Kaan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Anjani Dhrangadhariya · Accepted Answer

如@Himanshu所述，这是一个“(句子索引，特征索引)计数”的过程。在这里，“计数”部分是“单词在文档中出现的次数”。例如：

（0,1）1 （0,2）1 （0,6）1 （0,3）1 （0,8）1 （1,5）2 只有这个例子中，计数“2”表示单词“and”在此文档/句子中出现了两次。（1,1）1 （1,6）1 （1,3）1 （1,8）1 （2,4）1 （2,7）1 （2,0）1 （2,6）1 （3,1）1 （3,2）1 （3,6）1 （3,3）1 （3,8）1

让我们在您的代码中更改语料库。基本上，我在语料列表的第二个句子中两次添加了单词“second”。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 对于修改后的语料库，计数“4”表示单词“second”在此文档/句子中出现了四次

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1