在Python中创建稀疏词矩阵（词袋模型）

Question

在Python中创建稀疏词矩阵（词袋模型）

4

我有一个目录中的文本文件列表。

我想创建一个矩阵，其中包含整个语料库中每个单词在每个文件中的频率。（语料库是目录中每个文件中的每个唯一单词。）

示例：

File 1 - "aaa", "xyz", "cccc", "dddd", "aaa"  
File 2 - "abc", "aaa"
Corpus - "aaa", "abc", "cccc", "dddd", "xyz"

输出矩阵：

[[2, 0, 1, 1, 1],
 [1, 1, 0, 0, 0]]

我的解决方案是使用 collections.Counter 统计每个文件中的单词数量，得到一个字典，然后初始化一个大小为n × m（n 是文件数，m 是语料库中唯一单词数）的列表。接下来，我再次遍历每个文件，查看对象中每个单词的频率，并将其填入每个列表中。

有没有更好的方法来解决这个问题？也许可以在单次遍历中使用 collections.Counter ？

- ihmpall

你是否使用了scipy堆栈中的任何库？ - Igor Raush

你真的需要完整的矩阵吗？还是稀疏表示就足够了？一个字典列表（是的，Collections.Counter非常好用）也许可以胜任你的工作。 - Prune

@IgorRaush 没有其他的，只有 Collections.Counter。 - ihmpall

1

@ihmpall，我想不到一种在单次文件遍历中解决你的问题的方法；你现在拥有的可能是最好的选择。想一想：如果你事先不知道你的语料库，你无法将文件向量初始化为正确的维度，因此即使您在单次遍历中构建您的语料库并收集非零索引，您仍需要另一个遍历来将这些非零索引转换为向量。 - Igor Raush

1

你现有的方法中有一个小优化，就是在第一次遍历时使用 set() 而不是 Counter()。如果你只是想建立语料库，那么你不需要 Counter。无论如何，我强烈建议你考虑使用像 scipy.sparse.csr_matrix 这样的稀疏矩阵实现。 - Igor Raush

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Igor Raush · Accepted Answer

以下是一个相当简单的解决方案，它使用了 sklearn.feature_extraction.DictVectorizer。

from sklearn.feature_extraction import DictVectorizer
from collections import Counter, OrderedDict

File_1 = ('aaa', 'xyz', 'cccc', 'dddd', 'aaa')
File_2 = ('abc', 'aaa')

v = DictVectorizer()

# discover corpus and vectorize file word frequencies in a single pass
X = v.fit_transform(Counter(f) for f in (File_1, File_2))

# or, if you have a pre-defined corpus and/or would like to restrict the words you consider
# in your matrix, you can do

# Corpus = ('aaa', 'bbb', 'cccc', 'dddd', 'xyz')
# v.fit([OrderedDict.fromkeys(Corpus, 1)])
# X = v.transform(Counter(f) for f in (File_1, File_2))

# X is a sparse matrix, but you can access the A property to get a dense numpy.ndarray 
# representation
print(X)
print(X.A)

<2x5 sparse matrix of type '<type 'numpy.float64'>'
        with 6 stored elements in Compressed Sparse Row format>
array([[ 2.,  0.,  1.,  1.,  1.],
       [ 1.,  1.,  0.,  0.,  0.]])

单词到索引的映射可以通过 v.vocabulary_ 访问。

{'aaa': 0, 'bbb': 1, 'cccc': 2, 'dddd': 3, 'xyz': 4}