在Stackoverflow学到了很多东西之后,我终于有机会回馈社区了!与迄今为止提供的方法不同的一种方法是重新标记聚类以最大化对齐,然后比较变得容易。例如,如果一个算法将标签分配给六个项目集合,如L1=[0,0,1,1,2,2],另一个算法将L2=[2,2,0,0,1,1]分配给它们,您希望这两个标签分配等效,因为L1和L2本质上是相同地将项目分割成聚类。这种方法重新标记L2以最大化对齐,在上面的示例中,将导致L2==L1。
我在"Menéndez, Héctor D. A genetic approach to the graph and spectral clustering problem. MS thesis. 2012."中找到了解决此问题的方法,并且以下是使用numpy在Python中的实现。我对Python相对较新,因此可能有更好的实现方式,但我认为这可以完成工作:
def alignClusters(clstr1,clstr2):
"""Given 2 cluster assignments, this funciton will rename the second to
maximize alignment of elements within each cluster. This method is
described in in Menéndez, Héctor D. A genetic approach to the graph and
spectral clustering problem. MS thesis. 2012. (Assumes cluster labels
are consecutive integers starting with zero)
INPUTS:
clstr1 - The first clustering assignment
clstr2 - The second clustering assignment
OUTPUTS:
clstr2_temp - The second clustering assignment with clusters renumbered to
maximize alignment with the first clustering assignment """
K = np.max(clstr1)+1
simdist = np.zeros((K,K))
for i in range(K):
for j in range(K):
dcix = clstr1==i
dcjx = clstr2==j
dd = np.dot(dcix.astype(int),dcjx.astype(int))
simdist[i,j] = (dd/np.sum(dcix!=0) + dd/np.sum(dcjx!=0))/2
mask = np.zeros((K,K))
for i in range(K):
simdist_vec = np.reshape(simdist.T,(K**2,1))
I = np.argmax(simdist_vec)
xy = np.unravel_index(I,simdist.shape,order='F')
x = xy[0]
y = xy[1]
mask[x,y] = 1
simdist[x,:] = 0
simdist[:,y] = 0
swapIJ = np.unravel_index(np.where(mask.T),simdist.shape,order='F')
swapI = swapIJ[0][1,:]
swapJ = swapIJ[0][0,:]
clstr2_temp = np.copy(clstr2)
for k in range(swapI.shape[0]):
swapj = [swapJ[k]==i for i in clstr2]
clstr2_temp[swapj] = swapI[k]
return clstr2_temp
diff
,但我担心它不具备忽略聚类编号所需的灵活性。 - Jen