我正在尝试将scikit-learn的向量化器对象与gensim主题模型一起回收利用。原因很简单:首先,我已经有了大量向量化数据;其次,我更喜欢scikit-learn向量化器的界面和灵活性;第三,在我看来,即使使用gensim进行主题建模非常快速,但计算其字典(Dictionary()
)相对较慢。
类似的问题之前已经问过了,特别是这里和这里,而过渡解决方案是gensim的Sparse2Corpus()
函数,它将Scipy稀疏矩阵转换为gensim语料库对象。
然而,此转换不使用sklearn向量化器的vocabulary_
属性,该属性保存单词和特征ID之间的映射。为了打印每个主题的判别词(在gensim主题模型中为id2word
,描述为“从单词ID(整数)到单词(字符串)的映射”),需要此映射。
我知道gensim的Dictionary
对象比scikit的vect.vocabulary_
(一个简单的Python dict
)复杂得多(也更慢)...
有什么想法可以使用vect.vocabulary_
作为gensim模型中的id2word
吗?
一些示例代码:
# our data
documents = [u'Human machine interface for lab abc computer applications',
u'A survey of user opinion of computer system response time',
u'The EPS user interface management system',
u'System and human system engineering testing of EPS',
u'Relation of user perceived response time to error measurement',
u'The generation of random binary unordered trees',
u'The intersection graph of paths in trees',
u'Graph minors IV Widths of trees and well quasi ordering',
u'Graph minors A survey']
from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}
import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']
dict
。我真是太丢人了...关于时间/性能评论,使用1k个文档在gensim中创建字典大约需要0.9秒,再加上一整秒将其转换为BoW和Tfidf。相比之下,scikit-learn的TfidfVectorizer只需1.2秒完成整个工作。 - emiguevara