在Pandas数据框中计算最近邻居的平均距离

Question

在Pandas数据框中计算最近邻居的平均距离

4

我有一组对象和它们在不同时间的位置。我想要得到每个汽车与其最近邻之间的距离，并计算每个时间点的平均值。以下是一个示例数据框：

 time = [0, 0, 0, 1, 1, 2, 2]
 x = [216, 218, 217, 280, 290, 130, 132]
 y = [13, 12, 12, 110, 109, 3, 56]
 car = [1, 2, 3, 1, 3, 4, 5]
 df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
 df

         x       y      car
 time
  0     216     13       1
  0     218     12       2
  0     217     12       3
  1     280     110      1
  1     290     109      3
  2     130     3        4
  2     132     56       5

对于每个时间点，我想知道每辆汽车的最近邻居是谁。例如：

df2

          car    nearest_neighbour    euclidean_distance  
 time
  0       1            3                    1.41
  0       2            3                    1.00
  0       3            1                    1.41
  1       1            3                    10.05
  1       3            1                    10.05
  2       4            5                    53.04
  2       5            4                    53.04

我知道我可以从如何在pandas数据框中应用欧几里得距离函数到groupby对象中计算车辆之间的两两距离，但是我该如何获得每辆车的最近邻居呢？

之后，使用groupby获取每帧的距离平均值似乎很简单，但第二步真正让我困惑。感谢您的帮助！

- UserR6

2

可能是重复的问题：如何在 Pandas 数据框的 groupby 对象中应用欧几里得距离函数？ - Haleemur Ali

嗨，我使用了相同的示例，但我在这里尝试提出不同的问题。 - UserR6

啊，我不清楚这个问题和另一个问题之间的区别是什么。最终期望的输出结果看起来完全一样。请编辑您的问题，撤销关闭投票。 - Haleemur Ali

修改完成，希望这样更清晰明了！ - UserR6

2个回答

5

使用来自scipy.spatial.distance的cdist函数，获取一个矩阵表示每个汽车到其他所有汽车的距离。由于每个汽车到自身的距离为0，因此对角线元素均为0。

示例（time == 0）：

X = df[df.time==0][['x','y']]
dist = cdist(X, X)
dist
array([[0.        , 2.23606798, 1.41421356],
       [2.23606798, 0.        , 1.        ],
       [1.41421356, 1.        , 0.        ]])

使用 np.argsort 函数来获取按照距离矩阵排序后的索引。第一列只是行号，因为对角线元素为0。

idx = np.argsort(dist)
idx
array([[0, 2, 1],
       [1, 2, 0],
       [2, 1, 0]], dtype=int64)

然后，只需使用idx选择汽车和最近的距离。

dist[v[:,0], v[:,1]]
array([1.41421356, 1.        , 1.        ])

df[df.time==0].car.values[v[:,1]]
array([3, 3, 2], dtype=int64)

将上述逻辑结合成一个函数，该函数返回所需的数据框：

 def closest(df):
     X = df[['x', 'y']]
     dist = cdist(X, X)
     v = np.argsort(dist)
     return df.assign(euclidean_distance=dist[v[:, 0], v[:, 1]],
                      nearest_neighbour=df.car.values[v[:, 1]])

& 通过groupby使用它，最后删除索引，因为groupby-apply添加了额外的索引

df.groupby('time').apply(closest).reset_index(drop=True)

   time    x    y  car  euclidean_distance  nearest_neighbour
0     0  216   13    1            1.414214                  3
1     0  218   12    2            1.000000                  3
2     0  217   12    3            1.000000                  2
3     1  280  110    1           10.049876                  3
4     1  290  109    3           10.049876                  1
5     2  130    3    4           53.037722                  5
6     2  132   56    5           53.037722                  4

顺便提一下，你的时间0的示例输出是错误的。我的答案和Bacon的答案都展示了正确的结果。

- Haleemur Ali

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bacon · Accepted Answer

这可能有些繁琐，但你可以使用scikit中的最近邻算法nearest neighbors。

示例：

import numpy as np 
from sklearn.neighbors import NearestNeighbors
import pandas as pd

def nn(x):
    nbrs = NearestNeighbors(n_neighbors=2, algorithm='auto', metric='euclidean').fit(x)
    distances, indices = nbrs.kneighbors(x)
    return distances, indices

time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56] 
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})

#This has the index of the nearest neighbor in the group, as well as the distance
nns = df.drop('car', 1).groupby('time').apply(lambda x: nn(x.as_matrix()))

groups = df.groupby('time')
nn_rows = []
for i, nn_set in enumerate(nns):
    group = groups.get_group(i)
    for j, tup in enumerate(zip(nn_set[0], nn_set[1])):
        nn_rows.append({'time': i,
                        'car': group.iloc[j]['car'],
                        'nearest_neighbour': group.iloc[tup[1][1]]['car'],
                        'euclidean_distance': tup[0][1]})

nn_df = pd.DataFrame(nn_rows).set_index('time')

结果：

      car  euclidean_distance  nearest_neighbour
time                                            
0       1            1.414214                  3
0       2            1.000000                  3
0       3            1.000000                  2
1       1           10.049876                  3
1       3           10.049876                  1
2       4           53.037722                  5
2       5           53.037722                  4

（请注意，在时间0时，汽车3的最近邻是汽车2。sqrt（（217-216）** 2 +1）约为1.4142135623730951，而sqrt（（218-217）** 2 +0）= 1）