如何在引导式LDA中生成术语矩阵以进行主题建模？

Question

如何在引导式LDA中生成术语矩阵以进行主题建模？

3

我目前正在分析在线评论。我想尝试GuidedLDA（https://medium.freecodecamp.org/how-we-changed-unsupervised-lda-to-semi-supervised-guidedlda-e36a95f3a164），因为一些主题有重叠。我已经成功安装了该软件包。然而，我不确定如何使用Excel文档作为输入生成文档术语矩阵（在网站中代码中称为X）和词汇表。能否有人帮助我解决这个问题？我试图在各种论坛上搜索，但没有找到有效的方法。

- June Shelter

你找到其他解决方案了吗？我也在寻找同样的东西。 - Neeraz Lakkapragada

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mahendra · Answer 1

来自 textmining 包，TDM 类的摘录

import re

import csv

import os

'''

导入词干提取器

'''

您可以将以下代码保存为单独的Python文件，并在您的代码中作为常规模块导入，例如create_tdm.py

import create_tdm

X = create_tdm.TermDocumentMatrix("您的文本")

''' 用于词汇表 '''

word2id = dict((v, idx) for idx, v in enumerate("您的文本"))

'''

确保您的文本中有指导词列表，否则可能会出现关键错误，请进行检查 import pandas as pd

c = pd.DataFrame(list(word2id))

'''

类 TermDocumentMatrix（对象）：

"""
Class to efficiently create a term-document matrix.

The only initialization parameter is a tokenizer function, which should
take in a single string representing a document and return a list of
strings representing the tokens in the document. If the tokenizer
parameter is omitted it defaults to using textmining.simple_tokenize

Use the add_doc method to add a document (document is a string). Use the
write_csv method to output the current term-document matrix to a csv
file. You can use the rows method to return the rows of the matrix if
you wish to access the individual elements without writing directly to a
file.

"""

def __init__(self, tokenizer=simple_tokenize):
    """Initialize with tokenizer to split documents into words."""
    # Set tokenizer to use for tokenizing new documents
    self.tokenize = tokenizer
    # The term document matrix is a sparse matrix represented as a
    # list of dictionaries. Each dictionary contains the word
    # counts for a document.
    self.sparse = []
    # Keep track of the number of documents containing the word.
    self.doc_count = {}

def add_doc(self, document):
    """Add document to the term-document matrix."""
    # Split document up into list of strings
    words = self.tokenize(document)
    # Count word frequencies in this document
    word_counts = {}
    for word in words:
        word_counts[word] = word_counts.get(word, 0) + 1
    # Add word counts as new row to sparse matrix
    self.sparse.append(word_counts)
    # Add to total document count for each word
    for word in word_counts:
        self.doc_count[word] = self.doc_count.get(word, 0) + 1

def rows(self, cutoff=2):
    """Helper function that returns rows of term-document matrix."""
    # Get master list of words that meet or exceed the cutoff frequency
    words = [word for word in self.doc_count \
      if self.doc_count[word] >= cutoff]
    # Return header
    yield words
    # Loop over rows
    for row in self.sparse:
        # Get word counts for all words in master list. If a word does
        # not appear in this document it gets a count of 0.
        data = [row.get(word, 0) for word in words]
        yield data

def write_csv(self, filename, cutoff=2):
    """
    Write term-document matrix to a CSV file.

    filename is the name of the output file (e.g. 'mymatrix.csv').
    cutoff is an integer that specifies only words which appear in
    'cutoff' or more documents should be written out as columns in
    the matrix.

    """
    f = csv.writer(open(filename, 'wb'))
    for row in self.rows(cutoff=cutoff):
        f.writerow(row)