在另一个数据框中找到最近的点（有很多数据）

Question

在另一个数据框中找到最近的点（有很多数据）

pythonpandasoptimizationnearest-neighborgeopandas

6

问题很简单，我有两个数据帧：

一个有90,000套公寓及其纬度/经度的数据帧
一个有3,000家药店及其纬度/经度的数据帧

我希望为所有公寓创建一个新变量：“最近药店的距离”。

为此，我尝试了两种方法花费太多时间：

第一种方法：我创建了一个矩阵，我的公寓在行中，我的药店在列中，在它们之间的交汇处是距离，之后我只需取最小值以获得90,000个值的列向量。

我只用numpy中的双重for循环：

m,n=len(result['latitude']),len(pharma['lat'])
M = np.ones((m,n))
for i in range(m):
     for j in range(n):
        if (result['Code departement'][i]==pharma['departement'][j]):
            M[i,j] =(pharma['lat'][j]-result['latitude'][i])**2+(pharma['lng'][j]-result['longitude'] [i])**2

备注：我知道纬度/经度的公式有误，但公寓位于同一地区，这是一个很好的近似。

第二种方法：我使用了这个主题的解决方案（虽然数据较少，但问题类似）。 https://gis.stackexchange.com/questions/222315/geopandas-find-nearest-point-in-other-dataframe

我使用了GeoPandas和最近的方法：

from shapely.ops import nearest_points
pts3 = pharma.geometry.unary_union


def near(point, pts=pts3):
     nearest = pharma.geometry == nearest_points(point, pts)[1]
     return pharma[nearest].geometry.get_values()[0]

appart['Nearest'] = appart.apply(lambda row: near(row.geometry), axis=1)

正如我所说的，这两种方法都花费了太多时间，在运行1个小时后，我的电脑/笔记本死机了且失败了。

我的最终问题: 你是否有一个优化的方法可以更快地进行？这是可能的吗？如果已经优化，我将购买另一台电脑，但要寻找哪些标准才能拥有一个能够进行如此快速计算的PC呢？

- Arnaud Hureaux

我认为你应该遵循你所提到的问题的第二个答案，即使用空间索引来避免距离的全局计算。 - High Performance Mark

1

你有例子吗？因为我有这样的印象，即在第二种解决方案中使用了geopandas中的空间索引，但对所花费的时间没有产生任何影响。 - Arnaud Hureaux

那么我误解了你的代码，之前的评论是错误的。 - High Performance Mark

只是为了澄清，基于shapely的第二个选项不使用空间索引。 - martinfleis

1

不，肯定是我没有理解什么是空间索引。你有例子吗？或者链接？ - Arnaud Hureaux

你可以从这里开始 https://geoffboeing.com/2016/10/r-tree-spatial-index-python/，但请记住这是用于交叉点的。我在这里实现了类似的东西 https://docs.momepy.org/en/stable/_modules/momepy/elements.html#get_network_id 。希望能有所帮助。 - martinfleis

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mgc · Accepted Answer

我想Ball Tree是这个任务的适当结构。

您可以使用scikit-learn实现，下面的代码是针对您的情况进行调整的示例：

import numpy as np
import geopandas as gpd
from shapely.geometry import Point
from sklearn.neighbors import BallTree

## Create the two GeoDataFrame to replicate your dataset
appart = gpd.GeoDataFrame({
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(100000), np.random.rand(100000))
])

pharma = gpd.GeoDataFrame([{
        'geometry': Point(a, b),
        'x': a,
        'y': b,
    } for a, b in zip(np.random.rand(3000), np.random.rand(3000))
])

# Create a BallTree 
tree = BallTree(pharma[['x', 'y']].values, leaf_size=2)

# Query the BallTree on each feature from 'appart' to find the distance
# to the nearest 'pharma' and its id
appart['distance_nearest'], appart['id_nearest'] = tree.query(
    appart[['x', 'y']].values, # The input array for the query
    k=1, # The number of nearest neighbors
)

使用此方法，您可以相当快速地解决问题（如上面的示例，在我的计算机上，在100000个点的输入数据集上查找最近点的索引，只需不到一秒钟就能完成对3000个点的查找）。

默认情况下，BallTree的query方法返回最近邻的距离和其ID。如果您想要禁用返回最近邻的距离，则可以将return_distance参数设置为False。如果您真正只关心距离，您只能保存这个值：

appart['distance_nearest'], _ = tree.query(appart[['x', 'y']].values, k=1)