我有一个大型对称的二维距离数组。我想要获取最接近的N对观测值。
这个数组以numpy压缩数组的形式存储,包含大约1亿个观测值。
以下是一个示例,在较小的数组上获取100个最接近的距离,但速度比我希望的要慢得多。
import numpy as np
import random
import sklearn.metrics.pairwise
import scipy.spatial.distance
N = 100
r = np.array([random.randrange(1, 1000) for _ in range(0, 1000)])
c = r[:, None]
dists = scipy.spatial.distance.pdist(c, 'cityblock')
# these are the indices of the closest N observations
closest = dists.argsort()[:N]
# but it's really slow to get out the pairs of observations
def condensed_to_square_index(n, c):
# converts an index in a condensed array to the
# pair of observations it represents
# modified from here: https://dev59.com/_W435IYBdhLWcg3wlRMf
ti = np.triu_indices(n, 1)
return ti[0][c]+ 1, ti[1][c]+ 1
r = []
n = np.ceil(np.sqrt(2* len(dists)))
for i in closest:
pair = condensed_to_square_index(n, i)
r.append(pair)
在我看来,使用标准的numpy或scipy函数可能有更快的方法来完成这个任务,但我陷入了困境。
注意:如果有许多配对是等距的,那没关系,我不在乎它们的排序。