LDA gensim实现，两个不同文档之间的距离

Question

LDA gensim实现，两个不同文档之间的距离

5

编辑：我在这里发现了一个有趣的问题。这个链接显示gensim在训练和推理步骤中都使用随机性。因此，建议设置固定种子以便每次获得相同的结果。但是，为什么每个主题的概率都相同呢？

我想做的是找到每个Twitter用户的主题，并根据主题相似性计算Twitter用户之间的相似性。在gensim中是否有可能为每个用户计算相同的主题，还是必须计算主题字典并聚类每个用户主题？

一般来说，基于gensim提取的主题模型，比较两个Twitter用户的最佳方法是什么？我的代码如下：

   def preprocess(id): #Returns user word list (or list of user tweet)

        user_list =  user_corpus(id, 'user_'+str(id)+'.txt')
        documents = []
        for line in open('user_'+str(id)+'.txt'):
                 documents.append(line)
        #remove stop words
        lines = [line.rstrip() for line in open('stoplist.txt')]
        stoplist= set(lines)  
        texts = [[word for word in document.lower().split() if word not in stoplist]
                   for document in documents]
        # remove words that appear only once
        all_tokens = sum(texts, [])
        tokens_once = set(word for word in set(all_tokens) if all_tokens.count(word) < 3)
        texts = [[word for word in text if word not in tokens_once]
                   for text in texts]
        words = []
        for text in texts:
            for word in text:
                words.append(word)

        return words


    words1 = preprocess(14937173)
    words2 = preprocess(15386966)
    #Load the trained model
    lda = ldamodel.LdaModel.load('tmp/fashion1.lda')
    dictionary = corpora.Dictionary.load('tmp/fashion1.dict') #Load the trained dict

    corpus = [dictionary.doc2bow(words1)]
    tfidf = models.TfidfModel(corpus)
    corpus_tfidf = tfidf[corpus]
    corpus_lda = lda[corpus_tfidf]

    list1 = []
    for item in corpus_lda:
      list1.append(item)

    print lda.show_topic(0)
    corpus2 = [dictionary.doc2bow(words2)]
    tfidf2 = models.TfidfModel(corpus2)
    corpus_tfidf2 = tfidf2[corpus2]
    corpus_lda2 = lda[corpus_tfidf2]

    list2 = []
    for it in corpus_lda2:
      list2.append(it)

    print corpus_lda.show_topic(0)

当将一个用户词列表作为语料库时，返回用户语料库的主题概率：

 [(0, 0.10000000000000002), (1, 0.10000000000000002), (2, 0.10000000000000002),
  (3, 0.10000000000000002), (4, 0.10000000000000002), (5, 0.10000000000000002),
  (6, 0.10000000000000002), (7, 0.10000000000000002), (8, 0.10000000000000002),
  (9, 0.10000000000000002)]

在我使用用户推文列表的情况下，我会得到每条推文的计算主题。

问题2：以下是否合理：训练LDA模型使用多个Twitter用户并计算每个用户的主题（使用之前计算的LDA模型对每个用户语料库进行计算）？

在提供的示例中，list [0] 返回具有相等概率0.1的主题分布。基本上，每行文本对应于不同的推文。如果我使用corpus = [dictionary.doc2bow(text) for text in texts] 计算语料库，它将为每个推文单独给出概率。另一方面，如果我像示例中那样使用corpus = [dictionary.doc2bow(words)]，则我将只获得所有用户单词的语料库。在第二种情况下，gensim返回所有主题的相同概率。因此，对于两个用户，我获得相同的主题分布。

用户文本语料库是单词列表还是句子列表（推文列表）？

关于齐何和翁建树在twitterRank方法的实现，在第264页中说：我们将单个twitterer发布的推文聚合成一个大文档。因此，每个文档对应于一个twitterer。好的，我有点困惑了，如果文档将是所有用户推文，则语料库应包含什么？

- Jose Ramon

2个回答

1

根据官方文件，潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)是将词袋计数转化为低维度主题空间的一种转换方法。

你可以在TFIDF之上使用LSI，但不能使用LDA。如果在LDA上使用TFIDF，则会生成几乎相同的每个主题，你可以打印并检查它。

另请参见https://radimrehurek.com/gensim/tut2.html。

- Hao Fu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- christosh · Accepted Answer

Fere Res请查看以下建议这里。首先，您需要从所有用户计算lda模型，然后使用未知文档的提取向量进行计算，该向量在此处计算。

vec_bow = dictionary.doc2bow(doc.lower().split()) 
vec_lda = lda[vec_bow]

如果您打印以下内容：print(vec_lda)，您将得到未见过的文档分布到LDA模型主题的结果。