我该如何使用matplotlib绘制Kmeans文本聚类结果?

5
我有以下代码来使用scikit learn对一些示例文本进行聚类。
train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog", "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()
X = vect.fit_transform(train)
clf = KMeans(n_clusters=3)
clf.fit(X)
centroids = clf.cluster_centers_

plt.scatter(centroids[:, 0], centroids[:, 1], marker='x', s=80, linewidths=5)
plt.show()

我无法弄清楚的是如何绘制聚类结果。X 是一个 csr_matrix。我想要的是每个结果的 (x, y) 坐标以进行绘制。谢谢。
2个回答

8

您的tf-idf矩阵最终变成了3 x 17,因此您需要进行某种投影或降维以获得二维中心点。您有几个选项;以下是使用t-SNE的示例:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = ["is this good?", "this is bad", "some other text here", "i am hero", "blue jeans", "red carpet", "red dog",
     "blue sweater", "red hat", "kitty blue"]

vect = TfidfVectorizer()  
X = vect.fit_transform(train)
random_state = 1
clf = KMeans(n_clusters=3, random_state = random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 1000
model = TSNE(n_components=2, random_state=random_state, init=tsne_init, perplexity=tsne_perplexity,
         early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_centroids = model.fit_transform(centroids)
print transformed_centroids
plt.scatter(transformed_centroids[:, 0], transformed_centroids[:, 1], marker='x')
plt.show()

在你的例子中,如果你使用PCA来初始化你的t-SNE,你会得到分散的中心点;如果你使用随机初始化,你会得到微小的中心点和一个无聊的图片。


我想知道是否也可以在同一散点图上打印与每个质心接近的语料库中相应的文档?@Mike Delong - Andrea Moro
1
@AndreaMoro 是的,看起来这是可能的。t-SNE模型没有单独的fit和transform方法;如果有的话,我们可以在原始数据X上进行拟合并转换原始数据和质心,然后只需使用不同的标记将它们一起绘制。因此,我们需要从k-means模型中获取质心,将它们与X连接在一起,然后将整个东西一起通过t-SNE模型传递,然后分别绘制X行和质心行。看起来t-SNE模型保留了行顺序,所以这很简单。 - Mike DeLong
谢谢。我不确定我理解连接部分。这样做不会混淆信息吗?也许您可以修改您的答案来涵盖这一点? - Andrea Moro
1
@AndreaMoro 请看下面。 - Mike DeLong
随机状态在我的测试中始终存在,只是为了避免问题。我会认为问题出在使用sklearn包上吗?如果您有时间,我已经尝试并发布了另一个带有一些代码片段的问题,请参考这里:https://datascience.stackexchange.com/questions/79628/how-to-plot-centroids-and-clusters-resulting-from-a-kmean-model-based-on-a-text - Andrea Moro
显示剩余2条评论

4

这里是一个更长、更好的回答,包含更多数据:

import matplotlib.pyplot as plt
from numpy import concatenate
from sklearn.cluster import KMeans
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.manifold import TSNE

train = [
    'In 1917 a German Navy flight crashed at/near Off western Denmark with 18 aboard',
    # 'There were 18 passenger/crew fatalities',
    'In 1942 a Deutsche Lufthansa flight crashed at an unknown location with 4 aboard',
    # 'There were 4 passenger/crew fatalities',
    'In 1946 Trans Luxury Airlines flight 878 crashed at/near Moline, Illinois with 25 aboard',
    # 'There were 2 passenger/crew fatalities',
    'In 1947 a Slick Airways flight crashed at/near Hanksville, Utah with 3 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1949 a Royal Canadian Air Force flight crashed at/near Near Bigstone Lake, Manitoba with 21 aboard',
    'There were 21 passenger/crew fatalities',
    'In 1952 a Airwork flight crashed at/near Off Trapani, Italy with 57 aboard',
    'There were 7 passenger/crew fatalities',
    'In 1963 a Aeroflot flight crashed at/near Near Leningrad, Russia with 52 aboard',
    'In 1966 a Alaska Coastal Airlines flight crashed at/near Near Juneau, Alaska with 9 aboard',
    'There were 9 passenger/crew fatalities',
    'In 1986 a Air Taxi flight crashed at/near Frenchglen, Oregon with 6 aboard',
    'There were 3 passenger/crew fatalities',
    'In 1989 a Air Taxi flight crashed at/near Gold Beach, Oregon with 3 aboard',
    'There were 18 passenger/crew fatalities',
    'In 1990 a Republic of China Air Force flight crashed at/near Yunlin, Taiwan with 18 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1992 a Servicios Aereos Santa Ana flight crashed at/near Colorado, Bolivia with 10 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1994 Royal Air Maroc flight 630 crashed at/near Near Agadir, Morocco with 44 aboard',
    'There were 10 passenger/crew fatalities',
    'In 1995 Atlantic Southeast Airlines flight 529 crashed at/near Near Carrollton, GA with 29 aboard',
    'There were 44 passenger/crew fatalities',
    'In 1998 a Lumbini Airways flight crashed at/near Near Ghorepani, Nepal with 18 aboard',
    'There were 18 passenger/crew fatalities',
    'In 2004 a Venezuelan Air Force flight crashed at/near Near Maracay, Venezuela with 25 aboard',
    'There were 25 passenger/crew fatalities',
]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train)
n_clusters = 2
random_state = 1
clf = KMeans(n_clusters=n_clusters, random_state=random_state)
data = clf.fit(X)
centroids = clf.cluster_centers_
# we want to transform the rows and the centroids
everything = concatenate((X.todense(), centroids))

tsne_init = 'pca'  # could also be 'random'
tsne_perplexity = 20.0
tsne_early_exaggeration = 4.0
tsne_learning_rate = 10
model = TSNE(n_components=2, random_state=random_state, init=tsne_init,
    perplexity=tsne_perplexity,
    early_exaggeration=tsne_early_exaggeration, learning_rate=tsne_learning_rate)

transformed_everything = model.fit_transform(everything)
print(transformed_everything)
plt.scatter(transformed_everything[:-n_clusters, 0], transformed_everything[:-n_clusters, 1], marker='x')
plt.scatter(transformed_everything[-n_clusters:, 0], transformed_everything[-n_clusters:, 1], marker='o')

plt.show()

数据中有两个明显的聚类:一个是撞车描述,另一个是死亡人数总结。可以轻松注释掉行并微调聚类大小。按照编写的方式,代码应显示两个蓝色聚类,一个较大,一个较小,具有两个橙色质心。数据项比标记多:一些数据行被转换为空间中的相同点。最后,较小的t-SNE学习率似乎会产生更紧密的聚类。two clusters

我有18个集群,我想利用上面的例子。这对我的使用情况有效吗?还是您建议做出任何更改?@Mike DeLong - vishal singh

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接