使用Python中的scikit-learn kmeans对文本文档进行聚类

26

我需要实现scikit-learn的kMeans来对文本文档进行聚类。示例代码本身可以正常工作,但需要使用一些20newsgroups数据作为输入。我想要使用相同的代码来对如下所示的文档列表进行聚类:

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

我需要在kMeans示例代码中做哪些更改才能将此列表用作输入?(仅仅使用“dataset = documents”是不起作用的)


您提供的链接无法访问。 - Rocketq
1个回答

77

这是一个更简单的例子:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

向量化文本,即将字符串转换为数字特征

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

聚类文档

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

打印每个聚类的前几项术语

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
    print "Cluster %d:" % i,
    for ind in order_centroids[i, :10]:
        print ' %s' % terms[ind],
    print
如果您想更加直观地了解这是什么样子,请查看此答案

谢谢,但是它在最后的print命令中给了我语法错误 ='' 和 print() ... 我该怎么让它工作? :s - Nabila Shahid
1
哦,那是因为我是Python 3,我修改了我的答案。 - elyase
@elyase:如何修改此代码以获取每个聚类的中心句子? - Crista23
@Crista23,直接实现是不可能的。首先将句子转换为数值向量(词袋表示),然后进行聚类,但这种转换不会保留单词顺序(以及其他问题),因此无法从中心向量返回原始句子。您需要创造性地想办法才能从聚类中心得到“一些东西”。 - elyase
在这种情况下,如何对句子进行聚类并不清楚。 单词聚类在这个例子中效果很好,但是句子聚类却不行。 - Timur Nurlygayanov
@elyase,我该如何存储结果?mydict = {} for k in range(2,10): kmeans = KMeans(n_clusters=k, max_iter=300).fit(x) labels = kmeans.labels_ label_df = pd.DataFrame(labels.tolist(), columns=['class']) new_df = pd.concat((harsh, label_df), axis=1) #new_df.to_csv("result{}.csv".format(k)) order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]for i in range(k): for ind in order_centroids[i,:12]: mydict.update({i:(terms[ind])}) - krits

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接