Python中类似于R中的removeSparseTerms函数的等价函数是什么？

Question

Python中类似于R中的removeSparseTerms函数的等价函数是什么？

pythonrmachine-learningscikit-learntm

4

我们正在开展一项数据挖掘项目，并使用R语言中的tm软件包中的removeSparseTerms函数来减少文档-词项矩阵的特征。

然而，我们希望将代码移植到Python上。在sklearn、nltk或其他软件包中是否有类似的功能呢？

谢谢！

- AnirudhJ

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- omerbp · Accepted Answer

如果您的数据是纯文本，您可以使用CountVectorizer来完成此任务。

例如：

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=2)
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
vectorizer = vectorizer.fit(corpus)
print vectorizer.vocabulary_ 
#prints {u'this': 4, u'is': 2, u'the': 3, u'document': 0, u'first': 1}
X = vectorizer.transform(corpus)

现在 X 是文档-术语矩阵。（如果您涉足信息检索，还要考虑 Tf-idf术语加权。）

它可以帮助您轻松地用几行代码获得文档-术语矩阵。

关于稀疏性 - 您可以控制以下参数：

- min_df - 文档-术语矩阵中允许一个术语的最小文档频率。 - max_features - 文档-术语矩阵中允许的最大特征数。

或者，如果您已经有了文档-术语矩阵或Tf-idf矩阵，并且知道什么是稀疏的，请定义MIN_VAL_ALLOWED，然后执行以下操作：

import numpy as np
from scipy.sparse import csr_matrix
MIN_VAL_ALLOWED = 2

X = csr_matrix([[7,8,0],
                [2,1,1],
                [5,5,0]])

z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_VAL_ALLOWED)) #z is the non-sparse terms 

print X[:,z].toarray()
#prints X without the third term (as it is sparse)
[[7 8]
[2 1]
[5 5]]

如果您想对文档频率进行最小阈值设置，请先将矩阵二值化，然后以相同的方式使用它：

（使用X = X [:, z]，使X保持为csr_matrix。）

import numpy as np
from scipy.sparse import csr_matrix

MIN_DF_ALLOWED = 2

X = csr_matrix([[7, 1.3, 0.9, 0],
                [2, 1.2, 0.8  , 1],
                [5, 1.5, 0  , 0]])

#Creating a copy of the data
B = csr_matrix(X, copy=True)
B[B>0] = 1
z = np.squeeze(np.asarray(X.sum(axis=0) > MIN_DF_ALLOWED))
print  X[:,z].toarray()
#prints
[[ 7.   1.3]
[ 2.   1.2]
[ 5.   1.5]]

在这个例子中，第三和第四个术语（或列）已经消失了，因为它们只出现在两个文档（行）中。使用MIN_DF_ALLOWED来设置阈值。