SKLearn KMeans 收敛警告

Question

SKLearn KMeans 收敛警告

5

我正在使用SKLearn的KMeans聚类算法处理一个一维数据集。遇到的问题是，当我运行代码时，会出现ConvergenceWarning警告：

ConvergenceWarning: Number of distinct clusters (<some integer n>) found smaller than n_clusters (<some integer bigger than n>). Possibly due to duplicate points in X.
  return_n_iter=True)

除了源代码外，我找不到任何关于此事的信息，源代码也没有指出具体出了什么问题。我认为我的错误可能是因为我有一个一维数据结构，或者因为我在如何使用SKLearn中的一维数组时出了问题。以下是有问题的代码：

def cluster_data(data_arr):
    """clusters the uas for a specific site"""
    d = 1.0
    k = 1
    inertia_prev = 1.0
    while k <= MAX and d > DELTA: 
        #max is the size of the input array, delta is .05
        kmean = KMeans(n_clusters=k)
        prediction = kmean.fit_predict(data_arr.reshape(-1, 1))
        #bug could be in the reshape!
        inertia_curr = kmean.inertia_
        d = abs(1 - (inertia_curr / inertia_prev))
        inertia_prev = inertia_curr
        k += 1

一些演示IO：样例输入：

[(11.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 7.,) ( 0.,) ( 4.,) ( 7.,)
( 7.,) (13.,) ( 2.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,)
 ( 7.,) ( 2.,) ( 0.,) ( 0.,) (11.,) ( 7.,) ( 7.,) ( 0.,) ( 2.,) ( 1.,)
 ( 0.,) ( 0.,) ( 0.,) ( 7.,) ( 5.,) ( 0.,) ( 0.,) ( 4.,) ( 0.,) ( 0.,)
 ( 0.,) ( 0.,) ( 8.,) ( 0.,) ( 4.,) (10.,) ( 0.,) (11.,) (13.,) (11.,)
 (11.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 7.,) ( 7.,)
 ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) (10.,) (16.,) (15.,) (13.,) ( 2.,)
 ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,)
 ( 5.,) (15.,) (14.,) (14.,) (15.,) (14.,) (15.,) (15.,) ( 5.,) (14.,)
 (15.,) (15.,) (15.,) ( 5.,) (15.,) ( 7.,) ( 5.,) ( 5.,) ( 5.,) (11.,)
 ( 5.,) ( 5.,) ( 5.,) ( 2.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,)
 ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 0.,) ( 0.,) ( 5.,) ( 0.,)
 ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,)
 ( 0.,) ( 0.,) ( 2.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 1.,) ( 0.,) ( 7.,)
 ( 0.,) (11.,) ( 0.,) ( 0.,) (11.,) ( 5.,) ( 0.,) (15.,) ( 2.,) ( 2.,)
 ( 5.,) ( 5.,) (11.,) ( 0.,) ( 0.,) ( 0.,) (13.,) ( 2.,) ( 5.,) (13.,)
 ( 0.,) ( 8.,) ( 8.,) ( 2.,) ( 2.,) ( 0.,) ( 5.,) ( 5.,) ( 0.,) ( 0.,)
 ( 0.,) (11.,) ( 5.,) ( 5.,) ( 5.,) ( 0.,) ( 0.,) (11.,) ( 8.,) ( 5.,)
 ( 0.,) ( 7.,) ( 5.,) ( 0.,) (11.,) ( 0.,) ( 0.,) ( 2.,) ( 0.,) (11.,)
 (11.,) ( 7.,) ( 0.,) (13.,) (15.,) ( 0.,) ( 5.,) ( 7.,) ( 0.,) ( 5.,)
 ( 5.,) ( 2.,) ( 5.,) ( 0.,) ( 0.,) ( 5.,) ( 0.,) ( 0.,) ( 0.,) ( 7.,)
 ( 0.,) ( 0.,) (11.,) ( 0.,) ( 5.,) ( 5.,) ( 0.,) ( 5.,) (11.,) ( 5.,)
 ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 0.,) ( 5.,) ( 5.,) ( 0.,)
 ( 7.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,)
 (11.,) ( 0.,) ( 0.,) (11.,) (11.,) (11.,) ( 1.,) ( 1.,) ( 5.,) ( 5.,)
 ( 5.,) ( 0.,) ( 0.,) ( 0.,) ( 2.,) ( 0.,) ( 2.,) ( 0.,) ( 0.,) ( 0.,)
 ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,)
 ( 0.,) ( 5.,) ( 0.,) ( 0.,) ( 0.,) ( 5.,) ( 0.,) ( 0.,) ( 5.,) ( 0.,)
 ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 1.,) ( 1.,) ( 0.,) ( 0.,) ( 0.,)
 ( 0.,) ( 5.,) ( 0.,) ( 2.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,)
 ( 0.,) ( 0.,) ( 5.,) ( 0.,) (11.,) (11.,) ( 7.,) (11.,) (11.,) ( 2.,)
 ( 0.,) ( 2.,) ( 1.,) ( 0.,) ( 0.,) (11.,) ( 0.,) (11.,) ( 0.,) ( 7.,)
 ( 0.,) ( 0.,) (11.,) ( 5.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 5.,)
 ( 0.,) ( 4.,) ( 5.,) ( 5.,) ( 0.,) ( 0.,) ( 8.,) ( 7.,) ( 0.,) ( 0.,)
 ( 0.,) ( 0.,) ( 8.,) ( 0.,) ( 4.,) ( 0.,) ( 8.,) ( 8.,) ( 2.,) (10.,)
 ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 8.,) ( 0.,) ( 0.,) ( 5.,) (15.,)
 (15.,) ( 0.,) ( 5.,) (15.,) (15.,) ( 2.,) (15.,) ( 5.,) ( 2.,) ( 2.,)
 ( 2.,) (15.,) (13.,) ( 0.,) ( 2.,) ( 0.,) ( 2.,) ( 2.,) ( 2.,) ( 2.,)
 ( 2.,) ( 0.,) (13.,) ( 5.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 2.,)
 ( 0.,) ( 0.,) ( 0.,) (13.,) ( 0.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,) ( 5.,)
 ( 7.,) ( 5.,) ( 5.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 8.,)
 ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 2.,) (11.,) (10.,)
 ( 2.,) ( 7.,) ( 0.,) ( 2.,) ( 0.,) ( 0.,) ( 5.,) ( 2.,) ( 5.,) ( 2.,)
 ( 5.,) ( 0.,) ( 0.,) ( 2.,) ( 2.,) ( 0.,) ( 0.,) ( 0.,) ( 0.,) ( 2.,)
 ( 0.,) ( 2.,) ( 0.,) ( 2.,) ( 5.,) ( 5.,) ( 1.,) ( 0.,) ( 0.,) ( 0.,)
 ( 0.,) (11.,) ( 5.,) ( 2.,) ( 0.,) ( 0.,) (13.,) ( 0.,) ( 5.,) (15.,)
 ( 7.,) ( 5.,) (11.,) (11.,) (16.,) (15.,) ( 7.,) (16.,) (11.,) (15.,)
 (16.,) (11.,) (17.,) (15.,) (17.,) (15.,) (11.,) ( 7.,) (11.,) ( 7.,)
 ( 7.,) (15.,) (15.,) (15.,) (16.,) (16.,) (16.,) (16.,) (16.,) (16.,)
 (17.,) (16.,) (15.,) (13.,) (14.,) (15.,) (15.,) ( 7.,) (16.,) (15.,)
 (11.,) (15.,) (17.,) (11.,) (11.,) ( 7.,) (15.,) (15.,) (11.,) (11.,)
 (15.,) (15.,) (15.,) (16.,) (11.,) ( 7.,) (16.,) (11.,) (11.,) (15.,)
 (11.,) (15.,) ( 5.,) (16.,) (11.,) (11.,) ( 7.,) (15.,) (15.,) (15.,)]

示例输出：

ConvergenceWarning: Number of distinct clusters (14) found smaller than n_clusters (15). Possibly due to duplicate points in X. return_n_iter=True)

期望的输出结果：

no warning

您可能会注意到输入中有很多重复的值。这是正常现象，我想知道如何更好地聚类这些数据，以便避免出现具有重复质心的重复聚类。

- K. Dackow

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ZaydH · Accepted Answer

理想情况下，指定的簇数不应超过唯一数据点的数量。如果您能相应地调整质心计数，那么将不会出现警告。

Sklearn使用warning模块来发出警告。我们可以按照下面所示的方法抑制这些警告。

with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    cluster_data(data_arr)

在with块内部，所有警告都被抑制，因此应该谨慎使用此功能。