从一个Numpy距离数组中提取最接近的N对。

Question

从一个Numpy距离数组中提取最接近的N对。

3

我有一个大型对称的二维距离数组。我想要获取最接近的N对观测值。

这个数组以numpy压缩数组的形式存储，包含大约1亿个观测值。

以下是一个示例，在较小的数组上获取100个最接近的距离，但速度比我希望的要慢得多。

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
    # converts an index in a condensed array to the 
    # pair of observations it represents
    # modified from here: https://dev59.com/_W435IYBdhLWcg3wlRMf
    ti = np.triu_indices(n, 1)
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
    pair = condensed_to_square_index(n, i)
    r.append(pair)

在我看来，使用标准的numpy或scipy函数可能有更快的方法来完成这个任务，但我陷入了困境。

注意：如果有许多配对是等距的，那没关系，我不在乎它们的排序。

- roblanf

你可以通过使用部分排序来加快排序速度。最多只会快六倍左右。 - leewz

@leewangzhong，谢谢。但不幸的是，瓶颈不在排序上，而是将索引列表转换回观测对的过程。 - roblanf

相关：数百万个3D点：如何找到距离给定点最近的10个点？ - jfs

这个回答是否解决了你的问题？使用Python和NumPy从矩阵中获取最小/最大n个值和索引的高效方法。 - EliadL

4个回答

2

如果你使用的是numpy 1.8，你可以使用np.partition来显著加速最小值的定位：

def smallest_n(a, n):
    return np.sort(np.partition(a, n)[:n])

def argsmallest_n(a, n):
    ret = np.argpartition(a, n)[:n]
    b = np.take(a, ret)
    return np.take(ret, np.argsort(b))

dists = np.random.rand(1000*999//2) # a pdist array

In [3]: np.all(argsmallest_n(dists, 100) == np.argsort(dists)[:100])
Out[3]: True

In [4]: %timeit np.argsort(dists)[:100]
10 loops, best of 3: 73.5 ms per loop

In [5]: %timeit argsmallest_n(dists, 100)
100 loops, best of 3: 5.44 ms per loop

一旦你有最小的索引，就不需要使用循环来提取索引，可以一次性完成：

closest = argsmallest_n(dists, 100)
tu = np.triu_indices(1000, 1)
pairs = np.column_stack((np.take(tu[0], closest),
                         np.take(tu[1], closest))) + 1

- Jaime

太棒了。看起来在最后的 +1 前面缺少一个单闭合括号。另外，（相对于其他加速措施）我的计时表明，在列表长度为100个单位时，使用@mr-e的zip方法略快于使用np.column_stack生成成对列表。尽管我没有检查它们各自的扩展方式。 - roblanf

0

最佳解决方案可能不会生成所有距离。

建议：

创建一个最大大小为100的堆（如果它增长得更大，则减小它）。
使用最近点对算法找到最接近的一对。
将该对添加到堆（优先队列）中。
选择其中一个点。将其99个最近邻居添加到堆中。
从列表中删除所选点。
找到下一个最接近的一对并重复。添加的邻居数量为100减去运行最近点对算法的次数。

- leewz

0

你可以使用pandas DataFrame。首先，你需要将相似度矩阵声明为DataFrame（可以使用sklearn中的pairwise_distances()函数），并从源数据中添加列名和索引名。然后，你可以通过列名选择任意列（这是你感兴趣的列），然后使用pandas.DataFrame.sort_values()函数进行排序，再选择前5个或前10个。就是这样。

- Leo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- YXD · Accepted Answer

您不需要在每次调用 condensed_to_square_index 中计算 ti。以下是一个基本修改，仅计算一次：

import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance

N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]

dists = scipy.spatial.distance.pdist(c, 'cityblock')

# these are the indices of the closest N observations
closest = dists.argsort()[:N]

# but it's really slow to get out the pairs of observations
def condensed_to_square_index(ti, c):
    return ti[0][c]+ 1, ti[1][c]+ 1

r = []
n = np.ceil(np.sqrt(2* len(dists)))
ti = np.triu_indices(n, 1)

for i in closest:
    pair = condensed_to_square_index(ti, i)
    r.append(pair)

你可以使用向量化的方式创建r：

r  = zip(ti[0][closest] + 1, ti[1][closest] + 1)

或者

r = np.vstack(ti)[:, closest] + 1