是否有这个算法的变体或其他算法可实现所有聚类中成员数量相等?
2012年1月新增:
与其进行后处理,更好的方法是在聚类大小增长时保持大致相同的大小。
例如,为每个X找到3个最近的中心点,然后将X添加到具有最佳距离-λ群集大小的中心点。
diffsizecentres = kmeans( X, centres, ... )
X_centre_distances = scipy.spatial.distance.cdist( X, diffsizecentres )
# or just the nearest few centres
xtoc = samesizeclusters( X_centre_distances )
samesizecentres = [X[xtoc[c]].mean(axis=0) for c in range(k)]
...
如果迭代不会太大地改变中心,我会感到惊讶,但这取决于™。
您的N点数、N簇和N维度大约有多大?
#!/usr/bin/env python
from __future__ import division
from operator import itemgetter
import numpy as np
__date__ = "2011-03-28 Mar denis"
def samesizecluster( D ):
""" in: point-to-cluster-centre distances D, Npt x C
e.g. from scipy.spatial.distance.cdist
out: xtoc, X -> C, equal-size clusters
method: sort all D, greedy
"""
# could take only the nearest few x-to-C distances
# add constraints to real assignment algorithm ?
Npt, C = D.shape
clustersize = (Npt + C - 1) // C
xcd = list( np.ndenumerate(D) ) # ((0,0), d00), ((0,1), d01) ...
xcd.sort( key=itemgetter(1) )
xtoc = np.ones( Npt, int ) * -1
nincluster = np.zeros( C, int )
nall = 0
for (x,c), d in xcd:
if xtoc[x] < 0 and nincluster[c] < clustersize:
xtoc[x] = c
nincluster[c] += 1
nall += 1
if nall >= Npt: break
return xtoc
#...............................................................................
if __name__ == "__main__":
import random
import sys
from scipy.spatial import distance
# http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
Npt = 100
C = 3
dim = 3
seed = 1
exec( "\n".join( sys.argv[1:] )) # run this.py N= ...
np.set_printoptions( 2, threshold=200, edgeitems=5, suppress=True ) # .2f
random.seed(seed)
np.random.seed(seed)
X = np.random.uniform( size=(Npt,dim) )
centres = random.sample( X, C )
D = distance.cdist( X, centres )
xtoc = samesizecluster( D )
print "samesizecluster sizes:", np.bincount(xtoc)
# Npt=100 C=3 -> 32 34 34
我也曾苦苦思索如何解决这个问题。然而,我意识到我一直使用的关键字是错误的。如果你想让点结果成员的数量保持相同的大小,那么你正在进行分组,而不再是聚类。最终,我使用简单的Python脚本和PostGIS查询成功解决了这个问题。
例如,我有一个名为tb_points的表格,其中包含4000个坐标点,你想将其分成10个相同大小的组,每个组包含400个坐标点。以下是表格结构的示例:
CREATE TABLE tb_points (
id SERIAL PRIMARY KEY,
outlet_id INTEGER,
longitude FLOAT,
latitide FLOAT,
group_id INTEGER
);
那么你需要做的是:
这是Python的实现:
import psycopg2
dbhost = ''
dbuser = ''
dbpass = ''
dbname = ''
dbport = 5432
conn = psycopg2.connect(host = dbhost,
user = dbuser,
password = dbpass,
database = dbname,
port = dbport)
def fetch(sql):
cursor = conn.cursor()
rs = None
try:
cursor.execute(sql)
rs = cursor.fetchall()
except psycopg2.Error as e:
print(e.pgerror)
rs = 'error'
cursor.close()
return rs
def execScalar(sql):
cursor = conn.cursor()
try:
cursor.execute(sql)
conn.commit()
rowsaffected = cursor.rowcount
except psycopg2.Error as e:
print(e.pgerror)
rowsaffected = -1
conn.rollback()
cursor.close()
return rowsaffected
def select_first_cluster_id():
sql = """ SELECT a.outlet_id as ori_id, a.longitude as ori_lon,
a.latitude as ori_lat, b.outlet_id as dest_id, b.longitude as
dest_lon, b.latitude as dest_lat,
ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326)
AS geography),
CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography))
AS air_distance FROM tb_points a CROSS JOIN tb_points b WHERE
a.outlet_id != b.outlet_id and a.group_id is NULL and b.group_id is
null order by air_distance desc limit 1 """
return sql
def update_group_id(group_id, ori_id, limit_constraint):
sql = """ UPDATE tb_points
set group_id = %s
where outlet_id in
(select b.outlet_id
from tb_points a,
tb_points b
where a.outlet_id = '%s'
and a.group_id is null
and b.group_id is null
order by ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),4326) AS geography),
CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS geography)) asc
limit %s)
""" % (group_id, ori_id, limit_constraint)
return sql
def clustering():
data_constraint = [100]
n = 1
while n <= 10:
sql = select_first_cluster_id()
res = fetch(sql)
ori_id = res[0][0]
sql = update_group_id(n, ori_id, data_constraint[0])
print(sql)
execScalar(sql)
n += 1
clustering()
希望能有所帮助
同时也可以看一下K-d树,它将数据划分为每个分区的成员小于算法输入的BUCKET_SIZE。
这并不强制要求桶/分区的大小完全相同,但它们都会小于BUCKET_SIZE。
http://www.ideal.ece.utexas.edu/papers/arindam04tnn.pdf
他们还有一个在线/流媒体版本。
我将大部分答案中提供的算法添加到了https://github.com/brand17/clusters_equal_size存储库中。
最有效的是得票最多的答案。
我开发了另外几个算法(得票最多的仍然是最好的):
我修改了迭代交换提议的方式 - 通过识别和消除直接循环而不仅仅是交换(它可以稍微提高效率)
我还开发了以下算法:通过有效地移动 Voronoi 图边界,使得最近的质心对之间的点交换,以便点数之差不超过一个。