我使用Python中的gensim包来加载预训练好的Google word2vec数据集。然后我想使用k-means算法来找到单词向量中有意义的聚类,并为每个聚类找到代表性单词。我考虑使用对应向量最接近聚类质心的单词来代表该聚类,但是我的实验结果并不理想,不知道这是否是一个好主意。
我的示例代码如下:
import gensim
import numpy as np
import pandas as pd
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import pairwise_distances_argmin_min
model = gensim.models.KeyedVectors.load_word2vec_format('/home/Desktop/GoogleNews-vectors-negative300.bin', binary=True)
K=3
words = ["ship", "car", "truck", "bus", "vehicle", "bike", "tractor", "boat",
"apple", "banana", "fruit", "pear", "orange", "pineapple", "watermelon",
"dog", "pig", "animal", "cat", "monkey", "snake", "tiger", "rat", "duck", "rabbit", "fox"]
NumOfWords = len(words)
# construct the n-dimentional array for input data, each row is a word vector
x = np.zeros((NumOfWords, model.vector_size))
for i in range(0, NumOfWords):
x[i,]=model[words[i]]
# train the k-means model
classifier = MiniBatchKMeans(n_clusters=K, random_state=1, max_iter=100)
classifier.fit(x)
# check whether the words are clustered correctly
print(classifier.predict(x))
# find the index and the distance of the closest points from x to each class centroid
close = pairwise_distances_argmin_min(classifier.cluster_centers_, x, metric='euclidean')
index_closest_points = close[0]
distance_closest_points = close[1]
for i in range(0, K):
print("The closest word to the centroid of class {0} is {1}, the distance is {2}".format(i, words[index_closest_points[i]], distance_closest_points[i]))
以下是输出结果:
[2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0]
The closest word to the centroid of class 0 is rabbit, the distance is 1.578625818679259
The closest word to the centroid of class 1 is fruit, the distance is 1.8351978219013796
The closest word to the centroid of class 2 is car, the distance is 1.6586030662247868
在代码中,我有三类单词:车辆、水果和动物。从输出结果可以看出,k-means正确地将所有3类单词聚集在了一起,但使用质心法派生的代表性单词并不是很好,因为对于类别0,我想看到“动物”,但它给出了“兔子”,而对于类别2,我希望看到“车辆”,但它返回了“汽车”。非常感谢任何寻找每个集群良好代表单词的帮助或建议。