自动标注LDA生成的主题

Question

自动标注LDA生成的主题

6

我正在尝试对客户反馈进行分类，并在Python中运行了一个LDA模型，得到了以下10个主题的输出：

(0, u'0.559*"delivery" + 0.124*"area" + 0.018*"mile" + 0.016*"option" + 0.012*"partner" + 0.011*"traffic" + 0.011*"hub" + 0.011*"thanks" + 0.010*"city" + 0.009*"way"')
(1, u'0.397*"package" + 0.073*"address" + 0.055*"time" + 0.047*"customer" + 0.045*"apartment" + 0.037*"delivery" + 0.031*"number" + 0.026*"item" + 0.021*"support" + 0.018*"door"')
(2, u'0.190*"time" + 0.127*"order" + 0.113*"minute" + 0.075*"pickup" + 0.074*"restaurant" + 0.031*"food" + 0.027*"support" + 0.027*"delivery" + 0.026*"pick" + 0.018*"min"')
(3, u'0.072*"code" + 0.067*"gps" + 0.053*"map" + 0.050*"street" + 0.047*"building" + 0.043*"address" + 0.042*"navigation" + 0.039*"access" + 0.035*"point" + 0.028*"gate"')
(4, u'0.434*"hour" + 0.068*"time" + 0.034*"min" + 0.032*"amount" + 0.024*"pay" + 0.019*"gas" + 0.018*"road" + 0.017*"today" + 0.016*"traffic" + 0.014*"load"')
(5, u'0.245*"route" + 0.154*"warehouse" + 0.043*"minute" + 0.039*"need" + 0.039*"today" + 0.026*"box" + 0.025*"facility" + 0.025*"bag" + 0.022*"end" + 0.020*"manager"')
(6, u'0.371*"location" + 0.110*"pick" + 0.097*"system" + 0.040*"im" + 0.038*"employee" + 0.022*"evening" + 0.018*"issue" + 0.015*"request" + 0.014*"while" + 0.013*"delivers"')
(7, u'0.182*"schedule" + 0.181*"please" + 0.059*"morning" + 0.050*"application" + 0.040*"payment" + 0.026*"change" + 0.025*"advance" + 0.025*"slot" + 0.020*"date" + 0.020*"tomorrow"')
(8, u'0.138*"stop" + 0.110*"work" + 0.062*"name" + 0.055*"account" + 0.046*"home" + 0.043*"guy" + 0.030*"address" + 0.026*"city" + 0.025*"everything" + 0.025*"feature"')

有没有一种自动标记它们的方法？我有一个手动标记反馈的csv文件，但是我不想自己提供这些标签。我希望模型创建标签。这可能吗？

- Arman

请参考此处的类似问题（http://stackoverflow.com/questions/33921808/topic-modelling-assign-human-readable-labels-to-topic/33943104#33943104）。 - Ettore Rizza

可能是主题建模-为主题分配可读标签的重复问题。 - Ettore Rizza

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sam H. · Accepted Answer

这里的评论链接到另一个SO答案，该答案链接到a paper。假设您希望尽最少的努力使其工作。这是一种MVP样式的解决方案，适用于我：搜索Google以查找相关术语，然后在响应中查找关键字。

以下是一些可行但有点取巧的代码：

pip install cssselect

那么

from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
from collections import Counter


def get_srp_text(search_term):
    raw = get(f"https://www.google.com/search?q={topic_terms}").text
    page = fromstring(raw)


    blob = ""

    for result in page.cssselect("a"):
        for res in result.findall("div"):
            blob += ' '
            blob += res.text if res.text else " "
            blob += ' '
    return blob


def blob_cleaner(blob):
    clean_blob = blob.replace(r'[\/,\(,\),\:,_,-,\-]', ' ')
    return ''.join(e for e in blob if e.isalnum() or e.isspace())


def get_name_from_srp_blob(clean_blob):
    blob_tokens = list(filter(bool, map(lambda x: x if len(x) > 2 else '', clean_blob.split(' '))))
    c = Counter(blob_tokens)
    most_common = c.most_common(10)

    name = f"{most_common[0][0]}-{most_common[1][0]}"
    return name

pipeline = lambda x: get_name_from_srp_blob(blob_cleaner(get_srp_text(x)))

然后您可以从模型中获取主题词，例如：

topic_terms = "delivery area mile option partner traffic hub thanks city way"

name = pipeline(topic_terms)
print(name)

>>> City-Transportation

并且

topic_terms = "package address time customer apartment delivery number item support door"

name = pipeline(topic_terms)
print(name)

>>> Parcel-Package

你可以大大改进这个。例如，你可以使用词性标记仅查找最常见的名词，然后将它们用作名称。或者找到最常见的形容词和名词，将名称设为“形容词+名词”。更好的方法是，你可以从链接的网站获取文本，然后运行YAKE提取关键字。

无论如何，这演示了一种简单的自动命名聚类的方法，而不是直接使用机器学习（尽管Google肯定正在使用它来生成搜索结果，所以你从中受益）。