比较两个独立的pandas数据框中的列

Question

比较两个独立的pandas数据框中的列

pythonpandas

3

我有两个数据帧，都包含纬度和经度列。对于第一个数据帧中的每个lat/lon条目，我想评估第二个数据帧中的每个lat/lon对来确定距离。

例如：

df1: df2:

lat lon lat lon 0 38.32 -100.50 0 37.65 -97.87 1 42.51 -97.39 1 33.31 -96.40 2 33.45 -103.21 2 36.22 -100.01

38.32,-100.50和37.65,-97.87之间的距离 38.32,-100.50和33.31,-96.40之间的距离 38.32,-100.50和36.22,-100.01之间的距离 42.51,-97.39和37.65,-97.87之间的距离 42.51,-97.39和33.31,-96.40之间的距离 ...等等...

我不确定如何去做。感谢您的帮助！

- user1985891

3个回答

3

更新：如@root所指出，在这种情况下使用欧几里得度量并不是很有意义，因此让我们使用sklearn.neighbors.DistanceMetric

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('haversine')

首先，我们可以使用所有组合构建一个DF - (c) root：

x = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
      .drop('k',1)

向量化的“haversine”距离计算

x['dist'] = np.ravel(dist.pairwise(np.radians(df1),np.radians(df2)) * 6367)

结果：

In [86]: x
Out[86]:
    lat1    lon1   lat2    lon2         dist
0  38.32 -100.50  37.65  -97.87   242.073182
1  38.32 -100.50  33.31  -96.40   667.993048
2  38.32 -100.50  36.22 -100.01   237.350451
3  42.51  -97.39  37.65  -97.87   541.605087
4  42.51  -97.39  33.31  -96.40  1026.006744
5  42.51  -97.39  36.22 -100.01   734.219411
6  33.45 -103.21  37.65  -97.87   671.274044
7  33.45 -103.21  33.31  -96.40   632.004981
8  33.45 -103.21  36.22 -100.01   424.140594

新回答：

如果我理解正确，您可以使用成对距离 scipy.spatial.distance.pdist：

In [32]: from scipy.spatial.distance import pdist

In [43]: from itertools import combinations

In [34]: X = pd.concat([df1, df2])

In [35]: X
Out[35]:
     lat     lon
0  38.32 -100.50
1  42.51  -97.39
2  33.45 -103.21
0  37.65  -97.87
1  33.31  -96.40
2  36.22 -100.01

作为 Pandas.Series:

In [36]: s = pd.Series(pdist(X),
                       index=pd.MultiIndex.from_tuples(tuple(combinations(X.index, 2))))

In [37]: s
Out[37]:
0  1     5.218065
   2     5.573240
   0     2.714001
   1     6.473801
   2     2.156409
1  2    10.768287
   0     4.883646
   1     9.253113
   2     6.813846
2  0     6.793791
   1     6.811439
   2     4.232363
0  1     4.582194
   2     2.573810
1  2     4.636831
dtype: float64

作为 Pandas.DataFrame：

In [46]: s.rename_axis(['df1','df2']).reset_index(name='dist')
Out[46]:
    df1  df2       dist
0     0    1   5.218065
1     0    2   5.573240
2     0    0   2.714001
3     0    1   6.473801
4     0    2   2.156409
5     1    2  10.768287
6     1    0   4.883646
7     1    1   9.253113
8     1    2   6.813846
9     2    0   6.793791
10    2    1   6.811439
11    2    2   4.232363
12    0    1   4.582194
13    0    2   2.573810
14    1    2   4.636831

- MaxU - stand with Ukraine

1

欧几里得距离在经纬度坐标之间作为直接测量并不是很有意义。 - root

@root，这很有趣 - “haversine”度量方法给出的距离与“vincenty”方法相比非常接近，但并不完全相同... - MaxU - stand with Ukraine

1

是的，Haversine公式基于球形地球，但地球并不是一个完美的球体；它在赤道周围略微变胖（扁球体）。Vincenty公式考虑到了这一点。在大多数情况下，Haversine应该非常接近Vincenty，特别是对于相对较短的距离。主要的差异将出现在地球两侧的点上（对踵点）。 - root

@root，非常感谢您提供如此详细和清晰的解释！ - MaxU - stand with Ukraine

3

您可以执行交叉连接以获取所有经纬度组合，然后使用适当的测量计算距离。为此，您可以使用 geopy 包，该包提供了 geopy.distance.vincenty 和 geopy.distance.great_circle 两个函数。两者都应该可以给出有效的距离，其中 vincenty 可以给出更精确的结果，但计算速度较慢。

from geopy.distance import vincenty

# Function to compute distances.
def get_lat_lon_dist(row):
    # Store lat/long as tuples for input into distance functions.
    latlon1 = tuple(row[['lat1', 'lon1']])
    latlon2 = tuple(row[['lat2', 'lon2']])

    # Compute the distance.
    return vincenty(latlon1, latlon2).km

# Perform a cross-join to get all combinations of lat/lon.
dist = pd.merge(df1.assign(k=1), df2.assign(k=1), on='k', suffixes=('1', '2')) \
         .drop('k', axis=1)

# Compute the distances between lat/longs
dist['distance'] = dist.apply(get_lat_lon_dist, axis=1)

我在示例中使用了公里作为单位，但也可以指定其他单位，例如：

vincenty(latlon1, latlon2).miles

生成的输出：

    lat1    lon1   lat2    lon2     distance
0  38.32 -100.50  37.65  -97.87   242.709065
1  38.32 -100.50  33.31  -96.40   667.878723
2  38.32 -100.50  36.22 -100.01   237.080141
3  42.51  -97.39  37.65  -97.87   541.184297
4  42.51  -97.39  33.31  -96.40  1024.839512
5  42.51  -97.39  36.22 -100.01   733.819732
6  33.45 -103.21  37.65  -97.87   671.766908
7  33.45 -103.21  33.31  -96.40   633.751134
8  33.45 -103.21  36.22 -100.01   424.335874

编辑

正如在评论中@MaxU所指出的那样，您可以类似地使用Haversine公式的numpy实现来获得额外的性能。这应该等同于geopy中的great_circle函数。

- root

1

我认为你可以使用向量哈VERSINE公式。 - MaxU - stand with Ukraine

@MaxU：谢谢，我快速查找了一下哈弗辛实现，但只找到了一个纯Python的版本。 - root

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- piRSquared · Accepted Answer

欧几里得距离的计算方式如下：

$edpic$

你可以使用以下代码对两个数据框进行计算：

((df1 - df2) ** 2).sum(1) ** .5

0    2.714001
1    9.253113
2    4.232363
dtype: float64