用Python进行图形的谱聚类

34

我想使用Python的谱聚类来对一个图进行聚类。

谱聚类是一种更通用的技术,不仅可以应用于图形,还可以应用于图像或任何类型的数据,但它被认为是一种特殊的图形聚类技术。不幸的是,我在网上找不到Python中关于谱聚类图形的示例。

我希望能得到一些指导来解决这个问题。如果有人能帮我弄清楚,我就可以把文档添加到Scikit Learn中。

注意事项:


查看此源代码,可以发现SpectralClustering是一个面向对象的包装器,其中调用了spectral_clustering函数(以及其他内容)。https://stackoverflow.com/a/55720891/6509615 - rlchqrd
在加权图上有没有办法做到这一点? - Daniel Pop
3个回答

39

在没有太多谱聚类方面的经验,只是根据文档(跳到结尾查看结果!):

代码:

import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)

# Get your mentioned graph
G = nx.karate_club_graph()

# Get ground-truth: club-labels -> transform to 0/1 np-array
#     (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])

# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)

print('ground truth')
print(gt)

# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))

# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))

输出:

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828

总体思路:

这里介绍的数据和任务入手:

图中的节点代表大学空手道俱乐部的34名成员。(Zachary是一位社会学家,也是其中的一员。)两个节点之间的边表示这两个成员在正常俱乐部会议之外花费了大量时间在一起。该数据集很有趣,因为在Zachary收集数据的时候,空手道俱乐部发生了争议,分裂成了两派:一派由“Hi先生”领导,另一派由“John A”领导。事实证明,仅使用连通性信息(边),就可以恢复出这两个派别。

使用sklearn & spectral-clustering来解决这个问题:

如果相似性矩阵是一个图的邻接矩阵,则可以使用此方法找到标准化的图割。

这里将标准化的图割描述为:

找到一个图形的顶点集的两个不相交的分区A和B,使得A∪B=V且A∩B=∅

给定两个顶点之间的相似度测量w(i,j)(例如它们之间的连接),定义割值(及其标准化版本)为: cut(A,B)=SUM u在A中,v在B中:w(u,v)

...

我们寻求最小化组A和B之间的非联合,同时最大化每个组内部的联合

听起来很好。因此,我们创建邻接矩阵(nx.to_numpy_matrix(G))并将参数affinity设置为precomputed(因为我们的邻接矩阵是预先计算的相似性测量)。

另外,使用precomputed,可以使用用户提供的相似度矩阵。

编辑: 虽然对此不熟悉,但我寻找了需要调整的参数,并发现assign_labels:

用于在嵌入空间中分配标签的策略。有两种方法可以在拉普拉斯嵌入后分配标签。可以应用k-means,这是一个流行的选择。但它也可能对初始化很敏感。离散化是另一种方法,它对随机初始化不太敏感。

因此,尝试使用较不敏感的方法:

sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')

输出:

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351

这几乎是完美地符合实际情况啊!


谢谢!看到最终结果给了我一些信心,这条路可能是正确的选择。 - sascha
1
你能打印一下 adj_mat 吗?这样我就不用安装 networkx 也可以复制它了。 - sinapan

9

以下是一个虚拟的例子,只是为了看看它对简单相似度矩阵的影响--受Sascha答案的启发。

代码

import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)

adj_mat = [[3,2,2,0,0,0,0,0,0],
           [2,3,2,0,0,0,0,0,0],
           [2,2,3,1,0,0,0,0,0],
           [0,0,1,3,3,3,0,0,0],
           [0,0,0,3,3,3,0,0,0],
           [0,0,0,3,3,3,1,0,0],
           [0,0,0,0,0,1,3,1,1],
           [0,0,0,0,0,0,1,3,1],
           [0,0,0,0,0,0,1,1,3]]

adj_mat = np.array(adj_mat)

sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

print('spectral clustering')
print(sc.labels_)

输出

spectral clustering
[0 0 0 1 1 1 2 2 2]

1

首先,我们将图G聚类成K=2个簇,然后推广到所有的K值。

  • We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.

    Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:

    import networkx as nx
    import numpy as np
    
    A = np.zeros((11,11))
    A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
    A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
    A[0,5] = 5
    
    G = nx.from_numpy_matrix(A)
    ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
    labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
    labels
    # [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    
    nx.draw(G, pos=nx.drawing.layout.spring_layout(G), 
               with_labels=True, node_color=labels)  
    

enter image description here

  • We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:

    L = nx.laplacian_matrix(G)
    e, v = np.linalg.eig(L.todense()) 
    idx = np.argsort(e)
    e = e[idx]
    v = v[:,idx]
    labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
    labels
    # [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
    

    drawing the graph again with the clusters labeled:

enter image description here

  • With SpectralClustering from sklearn.cluster we can get the exact same result:

    sc = SpectralClustering(2, affinity='precomputed', n_init=100)
    sc.fit(A)
    sc.labels_
    # [0 0 0 0 0 1 1 1 1 1 1]
    

enter image description here

  • We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):

    enter image description here

    The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:

    A = np.array([[3,2,2,0,0,0,0,0,0],
             [2,3,2,0,0,0,0,0,0],
             [2,2,3,1,0,0,0,0,0],
             [0,0,1,3,3,3,0,0,0],
             [0,0,0,3,3,3,0,0,0],
             [0,0,0,3,3,3,1,0,0],
             [0,0,0,0,0,1,3,1,1],
             [0,0,0,0,0,0,1,3,1],
             [0,0,0,0,0,0,1,1,3]])
    
    K = 3 # K clusters
    G = nx.from_numpy_matrix(A)
    ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
    kmeans.labels_
    # array([2, 2, 2, 0, 0, 0, 1, 1, 1])
    

    Now draw the clustered graph, with labeling the nodes with the clusters obtained above:

    enter image description here


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接