Python：在三维空间中使用DBSCAN

Question

Python：在三维空间中使用DBSCAN

12

我一直在寻找一个可以处理三维点的DBSCAN实现，但没有太多进展。有人知道处理这个问题的库或者有相关经验吗？我假定DBSCAN算法可以处理三维数据，通过将e值设为半径度量，并用欧几里得距离来测量点之间的距离。如果有人尝试过实现这个问题并想分享经验，那也会非常感激，谢谢。

- user2909415

1

你尝试过sklearn吗？它有DBSCAN算法，而且还可以处理三维数据。 - Has QUIT--Anony-Mousse

@Anony-Mousse 我试过了，但是不起作用。除非我做错了什么。我通过dbscan.fit(X)给它提供了一个三维坐标列表，然后它给我返回了一个错误：期望的维度大小为2而不是3。否则，我知道你可以提供一个距离矩阵，但这对我没有太大价值，我可以自己编写一个DBSCAN算法。 - user2909415

两个选项：

自己实现DBSCAN，使用R Tree库，只需一到两天时间即可完成
使用奇异值分解

- Victor Juliet

4个回答

5

这就是我想到的实现方式，虽然不是最高效的，但是它能够工作。例如，区域查询是这个算法中的主要时间消耗器，计算两点之间的距离会重复多次，而不是将其存储以备后用。

class DBSCAN(object):

def __init__(self, eps=0, min_points=2):
    self.eps = eps
    self.min_points = min_points
    self.visited = []
    self.noise = []
    self.clusters = []
    self.dp = []

def cluster(self, data_points):
    self.visited = []
    self.dp = data_points
    c = 0
    for point in data_points:
        if point not in self.visited:
            self.visited.append(point)
            neighbours = self.region_query(point)
            if len(neighbours) < self.min_points:
                self.noise.append(point)
            else:
                c += 1
                self.expand_cluster(c, neighbours)

def expand_cluster(self, cluster_number, p_neighbours):
    cluster = ("Cluster: %d" % cluster_number, [])
    self.clusters.append(cluster)
    new_points = p_neighbours
    while new_points:
        new_points = self.pool(cluster, new_points)

def region_query(self, p):
    result = []
    for d in self.dp:
        distance = (((d[0] - p[0])**2 + (d[1] - p[1])**2 + (d[2] - p[2])**2)**0.5)
        if distance <= self.eps:
            result.append(d)
    return result

def pool(self, cluster, p_neighbours):
    new_neighbours = []
    for n in p_neighbours:
        if n not in self.visited:
            self.visited.append(n)
            n_neighbours = self.region_query(n)
            if len(n_neighbours) >= self.min_points:
                new_neighbours = self.unexplored(p_neighbours, n_neighbours)
        for c in self.clusters:
            if n not in c[1] and n not in cluster[1]:
                cluster[1].append(n)
    return new_neighbours

@staticmethod
def unexplored(x, y):
    z = []
    for p in y:
        if p not in x:
            z.append(p)
    return z

- user2909415

1

为什么要费心编写自定义代码，当你可以简单地计算2D距离矩阵并在其上执行“DBSCAN”呢？ - Sergey Bushmanov

1

为什么不使用PCA将数据压缩到2维，然后只使用2维运行DBSCAN呢？这似乎比尝试自定义构建其他东西要容易得多。

- osinrider04

1

在使用第一个答案中提供的代码一段时间后，我得出了一些结论： 1）噪点可能会出现在后面的聚类中。 2）由于无法正确处理已访问和未探索点的问题，它会抛出其他子集聚类，导致聚类少于min_points，并且 3）某些点可能会出现在两个聚类中-它们可以从两个聚类到达，在这个代码中甚至可能是其中一个聚类的核心点。官方DBSCAN算法将任何核心点放置在其所属的核心聚类中，但将仅从两个聚类可达的点放置在找到它们可达的第一个聚类中。这使得这些点的聚类取决于数据中的点的顺序，但所有点只出现一次在输出数据中-要么作为单个聚类，要么作为噪声。某些应用程序希望这些可从两个聚类到达的共享点放置在两个聚类中，但核心点应仅出现在一个聚类中。

这是我想到的方法。它两次计算两点之间的分离距离，不使用任何树，但会立即排除没有近邻的点，并创建核心点列表，因此只需要在构建核心时考虑这些点。它利用集合进行包含测试。请注意，此实现会将共享点放置在它们可达到的所有聚类中。

 class DBSCAN(object):
    def __init__(self, eps=0, min_points=2):
        self.eps = eps
        self.min_points = min_points
        self.noise = []
        self.clusters = []
        self.dp = []
        self.near_neighbours = []
        self.wp = set()
        self.proto_cores = set()

    def cluster(self, points):
        c = 0
        self.dp = points
        self.near_neighbours = self.nn(points)
        while self.proto_cores:
            near_points = set(self.proto_cores)
            for p in near_points:
                c += 1
                core = self.add_core(self.near_neighbours[p])
                complete_cluster = self.expand_cluster(core)
                self.clusters.append(["Cluster: %d" % c, complete_cluster])
                self.proto_cores -= core
                break
        for pt in self.dp:
            flag = True
            for c in self.clusters:
                if pt in c[1]:
                    flag = False
            if flag:
                self.noise.append(pt)

    # set up dictionary of near neighbours,create working_point and proto_core sets
    def nn(self, points):
        self.wp = set()
        self.proto_cores = set()
        i = -1
        near_neighbours = {}
        for p in points:
            i += 1
            j = -1
            neighbours = []
            for q in points:
                j += 1
                distance = (((q[0] - p[0]) ** 2 + (q[1] - p[1]) ** 2
                             + (q[2] - p[2]) ** 2) ** 0.5)
                if distance <= self.eps:
                    neighbours.append(j)
            near_neighbours[i] = neighbours
            if len(near_neighbours[i]) > 1:
                self.wp |= {i}
            if len(near_neighbours[i]) >= self.min_points:
                self.proto_cores |= {i}
        return near_neighbours

    # add cluster core points
    def add_core(self, neighbours):
        core_points = set(neighbours)
        visited = set()
        while neighbours:
            points = set(neighbours)
            neighbours = set()
            for p in points:
                visited |= {p}
                if len(self.near_neighbours[p]) >= self.min_points:
                    core_points |= set(self.near_neighbours[p])
                    neighbours |= set(self.near_neighbours[p])
            neighbours -= visited
        return core_points

    # expand cluster to reachable points and rebuild actual point values
    def expand_cluster(self, core):
        core_points = set(core)
        full_cluster = []
        for p in core_points:
            core |= set(self.near_neighbours[p])
        for point_number in core:
            full_cluster.append(self.dp[point_number])
        return full_cluster

- Pmason

你如何运行这个脚本？我应该在笔记本中运行哪个示例代码来运行这个类？ - Artemis F

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mayTheCodeBeWithYou · Accepted Answer

你可以使用sklearn进行DBSCAN。这是一些对我有效的代码-

from sklearn.cluster import DBSCAN
import numpy as np
data = np.random.rand(500,3)

db = DBSCAN(eps=0.12, min_samples=1).fit(data)
labels = db.labels_
from collections import Counter
Counter(labels)

我得到的输出是-

Counter({1: 342, 10: 30, 31: 13, 13: 11, 30: 10, 24: 5, 29: 5, 2: 4, 18: 4,
19: 4, 28: 4, 49: 4, 3: 3, 17: 3, 23: 3, 32: 3, 7: 2, 9: 2, 12: 2, 14: 2, 15: 2,
16: 2, 20: 2, 21: 2, 26: 2, 39: 2, 41: 2, 46: 2, 0: 1, 4: 1, 5: 1, 6: 1, 8: 1, 11:
1, 22: 1, 25: 1, 27: 1, 33: 1, 34: 1, 35: 1, 36: 1, 37: 1, 38: 1, 40: 1, 42: 1,
43: 1, 44: 1, 45: 1, 47: 1, 48: 1, 50: 1, 51: 1, 52: 1, 53: 1, 54: 1, 55: 1})

因此，聚类识别出55个簇，每个簇中的数据点数量如上所示。