用Python进行图形的谱聚类

Question

用Python进行图形的谱聚类

pythonscikit-learncluster-analysisgraph-theoryspectral

34

我想使用Python的谱聚类来对一个图进行聚类。

谱聚类是一种更通用的技术，不仅可以应用于图形，还可以应用于图像或任何类型的数据，但它被认为是一种特殊的图形聚类技术。不幸的是，我在网上找不到Python中关于谱聚类图形的示例。

Scikit Learn有两个文档记录的谱聚类方法：SpectralClustering和spectral_clustering，它们似乎不是别名。
这些方法都提到它们可以用于图形，但没有提供具体的指令。用户指南也没有。我已经向开发人员要求提供这样的示例，但他们工作繁忙，还没来得及处理。
一个很好的网络用于记录这个问题是Karate Club网络。它已经包含在networkx的方法中。

我希望能得到一些指导来解决这个问题。如果有人能帮我弄清楚，我就可以把文档添加到Scikit Learn中。

注意事项：

这个问题的一个类似版本已经在此网站上提出过。

- Alex Lenail

查看此源代码，可以发现SpectralClustering是一个面向对象的包装器，其中调用了spectral_clustering函数（以及其他内容）。https://stackoverflow.com/a/55720891/6509615 - rlchqrd

在加权图上有没有办法做到这一点？ - Daniel Pop

3个回答

9

以下是一个虚拟的例子，只是为了看看它对简单相似度矩阵的影响--受Sascha答案的启发。

代码

import numpy as np
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(0)

adj_mat = [[3,2,2,0,0,0,0,0,0],
           [2,3,2,0,0,0,0,0,0],
           [2,2,3,1,0,0,0,0,0],
           [0,0,1,3,3,3,0,0,0],
           [0,0,0,3,3,3,0,0,0],
           [0,0,0,3,3,3,1,0,0],
           [0,0,0,0,0,1,3,1,1],
           [0,0,0,0,0,0,1,3,1],
           [0,0,0,0,0,0,1,1,3]]

adj_mat = np.array(adj_mat)

sc = SpectralClustering(3, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

print('spectral clustering')
print(sc.labels_)

输出

spectral clustering
[0 0 0 1 1 1 2 2 2]

- sinapan

1

首先，我们将图G聚类成K=2个簇，然后推广到所有的K值。

We can use the function linalg.algebraicconnectivity.fiedler_vector() from networkx, in order to compute the Fiedler vector of (the eigenvector corresponding to the second smallest eigenvalue of the Graph Laplacian matrix) of the graph, with the assumption that the graph is a connected undirected graph.

Then we can threshold the values of the eigenvector to compute the cluster index each node corresponds to, as shown in the next code block:

import networkx as nx
import numpy as np

A = np.zeros((11,11))
A[0,1] = A[0,2] = A[0,3] = A[0,4] = 1
A[5,6] = A[5,7] = A[5,8] = A[5,9] = A[5,10] = 1
A[0,5] = 5

G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
labels = [0 if v < 0 else 1 for v in ev] # using threshold 0
labels
# [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]

nx.draw(G, pos=nx.drawing.layout.spring_layout(G), 
           with_labels=True, node_color=labels)

We can obtain the same clustering with eigen analysis of the graph Laplacian and then by choosing the eigenvector corresponding to the 2nd smallest eigenvalue too:

L = nx.laplacian_matrix(G)
e, v = np.linalg.eig(L.todense()) 
idx = np.argsort(e)
e = e[idx]
v = v[:,idx]
labels = [0 if x < 0 else 1 for x in v[:,1]] # using threshold 0
labels
# [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

drawing the graph again with the clusters labeled:

With SpectralClustering from sklearn.cluster we can get the exact same result:

sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(A)
sc.labels_
# [0 0 0 0 0 1 1 1 1 1 1]

We can generalize the above for K > 2 clusters as follows (use kmeans clustering for partitioning the Fiedler vector instead of thresholding):

The following code demonstrates how k-means clustering can be used to partition the Fiedler vector and obtain a 3-clustering of a graph defined by the following adjacency matrix:

A = np.array([[3,2,2,0,0,0,0,0,0],
         [2,3,2,0,0,0,0,0,0],
         [2,2,3,1,0,0,0,0,0],
         [0,0,1,3,3,3,0,0,0],
         [0,0,0,3,3,3,0,0,0],
         [0,0,0,3,3,3,1,0,0],
         [0,0,0,0,0,1,3,1,1],
         [0,0,0,0,0,0,1,3,1],
         [0,0,0,0,0,0,1,1,3]])

K = 3 # K clusters
G = nx.from_numpy_matrix(A)
ev = nx.linalg.algebraicconnectivity.fiedler_vector(G)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=K, random_state=0).fit(ev.reshape(-1,1))
kmeans.labels_
# array([2, 2, 2, 0, 0, 0, 1, 1, 1])

Now draw the clustered graph, with labeling the nodes with the clusters obtained above:

- Sandipan Dey

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sascha · Accepted Answer

在没有太多谱聚类方面的经验，只是根据文档（跳到结尾查看结果！）：

代码：

import numpy as np
import networkx as nx
from sklearn.cluster import SpectralClustering
from sklearn import metrics
np.random.seed(1)

# Get your mentioned graph
G = nx.karate_club_graph()

# Get ground-truth: club-labels -> transform to 0/1 np-array
#     (possible overcomplicated networkx usage here)
gt_dict = nx.get_node_attributes(G, 'club')
gt = [gt_dict[i] for i in G.nodes()]
gt = np.array([0 if i == 'Mr. Hi' else 1 for i in gt])

# Get adjacency-matrix as numpy-array
adj_mat = nx.to_numpy_matrix(G)

print('ground truth')
print(gt)

# Cluster
sc = SpectralClustering(2, affinity='precomputed', n_init=100)
sc.fit(adj_mat)

# Compare ground-truth and clustering-results
print('spectral clustering')
print(sc.labels_)
print('just for better-visualization: invert clusters (permutation)')
print(np.abs(sc.labels_ - 1))

# Calculate some clustering metrics
print(metrics.adjusted_rand_score(gt, sc.labels_))
print(metrics.adjusted_mutual_info_score(gt, sc.labels_))

输出：

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[1 1 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
just for better-visualization: invert clusters (permutation)
[0 0 1 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
0.204094758281
0.271689477828

总体思路：

从这里介绍的数据和任务入手:

图中的节点代表大学空手道俱乐部的34名成员。(Zachary是一位社会学家，也是其中的一员。)两个节点之间的边表示这两个成员在正常俱乐部会议之外花费了大量时间在一起。该数据集很有趣，因为在Zachary收集数据的时候，空手道俱乐部发生了争议，分裂成了两派:一派由“Hi先生”领导，另一派由“John A”领导。事实证明，仅使用连通性信息(边)，就可以恢复出这两个派别。

使用sklearn & spectral-clustering来解决这个问题:

如果相似性矩阵是一个图的邻接矩阵，则可以使用此方法找到标准化的图割。

这里将标准化的图割描述为：

找到一个图形的顶点集的两个不相交的分区A和B，使得A∪B=V且A∩B=∅

给定两个顶点之间的相似度测量w(i,j)(例如它们之间的连接)，定义割值(及其标准化版本)为： cut(A,B)=SUM u在A中，v在B中:w(u,v)

...

我们寻求最小化组A和B之间的非联合，同时最大化每个组内部的联合

听起来很好。因此，我们创建邻接矩阵(nx.to_numpy_matrix(G))并将参数affinity设置为precomputed(因为我们的邻接矩阵是预先计算的相似性测量)。

另外，使用precomputed，可以使用用户提供的相似度矩阵。

编辑: 虽然对此不熟悉，但我寻找了需要调整的参数，并发现assign_labels:

用于在嵌入空间中分配标签的策略。有两种方法可以在拉普拉斯嵌入后分配标签。可以应用k-means，这是一个流行的选择。但它也可能对初始化很敏感。离散化是另一种方法，它对随机初始化不太敏感。

因此，尝试使用较不敏感的方法:

sc = SpectralClustering(2, affinity='precomputed', n_init=100, assign_labels='discretize')

输出：

ground truth
[0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
spectral clustering
[0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1]
just for better-visualization: invert clusters (permutation)
[1 1 0 1 1 1 1 1 0 0 1 1 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
0.771725032425
0.722546051351

这几乎是完美地符合实际情况啊！