在Pandas数据框中查找最近的点

6

我对Python还比较新。我在Postgres中有以下数据表。这些是带有四个坐标的多边形值,具有相同的IdZONE名称。我将这些数据存储在名为df1的Python数据框中。

Id  Order   Lat              Lon            Zone
00001   1   50.6373473  3.075029928          A
00001   2   50.63740441 3.075068636          A
00001   3   50.63744285 3.074951754          A 
00001   4   50.63737839 3.074913884          A 
00002   1   50.6376054  3.0750528            B
00002   2   50.6375896  3.0751209            B
00002   3   50.6374239  3.0750246            B
00002   4   50.6374404  3.0749554            B

我有一些Json数据,其中包含LonLat值,我已将它们存储在名为df2的Python数据框中。

Lat                  Lon
50.6375524099   3.07507914474
50.6375714407   3.07508201591

我的任务是将df2LatLon值与df1中每个区域的四个坐标进行比较,以提取区域名称并将其添加到df2中。
例如,(50.637552409 3.07507914474)属于B区
#This is ID with Zone
df1 = pd.read_sql_query("""SELECT * from "zmap" """,con=engine)
#This is with lat,lon values
df2 = pd.read_sql_query("""SELECT * from "E1" """,con=engine)
df2['latlon'] = zip(df2.lat, df2.lon)
zones = [
["A", [[50.637347297, 3.075029928], [50.637404408, 3.075068636], [50.637442847, 3.074951754],[50.637378390, 3.074913884]]]]
for i in range(0, len(zones)):  # for each zone points
    X = mplPath.Path(np.array(zones[i][1]))
    # find if points are Zones
    Y= X.contains_points(df2.latlon.values.tolist())
    # Label points that are in the current zone
    df2[Y, 'zone'] = zones[i][0]

目前我已手动为区域'A'完成了此操作。我需要为df2中的坐标生成“区域”。


1
抱歉,SO不是教程或免费代码编写网站。请先展示您的尝试。 - Julien
只需要使用 df2.merge(df1) 就可以了。 - Cainã Max Couto-Silva
2个回答

14

这似乎是使用scipy cdist的一个很好的用例,这里也讨论了它


import pandas as pd
from scipy.spatial.distance import cdist


data1 = {'Lat': pd.Series([50.6373473,50.63740441,50.63744285,50.63737839,50.6376054,50.6375896,50.6374239,50.6374404]),
         'Lon': pd.Series([3.075029928,3.075068636,3.074951754,3.074913884,3.0750528,3.0751209,3.0750246,3.0749554]),
         'Zone': pd.Series(['A','A','A','A','B','B','B','B'])}

data2 = {'Lat': pd.Series([50.6375524099,50.6375714407]),
         'Lon': pd.Series([3.07507914474,3.07508201591])}


def closest_point(point, points):
    """ Find closest point from a list of points. """
    return points[cdist([point], points).argmin()]

def match_value(df, col1, x, col2):
    """ Match value x from col1 row to value in col2. """
    return df[df[col1] == x][col2].values[0]


df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

df1['point'] = [(x, y) for x,y in zip(df1['Lat'], df1['Lon'])]
df2['point'] = [(x, y) for x,y in zip(df2['Lat'], df2['Lon'])]

df2['closest'] = [closest_point(x, list(df1['point'])) for x in df2['point']]
df2['zone'] = [match_value(df1, 'point', x, 'Zone') for x in df2['closest']]

print(df2)
#    Lat        Lon       point                           closest                  zone
# 0  50.637552  3.075079  (50.6375524099, 3.07507914474)  (50.6375896, 3.0751209)  B
# 1  50.637571  3.075082  (50.6375714407, 3.07508201591)  (50.6375896, 3.0751209)  B

感谢您提供这个优秀的答案 - 它真的帮助了我 :) - Pad
match_value的逻辑非常耗费资源。更高效的方法是使用merge,即df2.merge(df1[['point','zone']].rename(columns={'point':'closest'}))。 - Chris Decker
那太棒了。 - Seyed Omid Nabavi
为什么没有人会因cdist默认使用欧几里得度量而对经度和纬度坐标不适合感到困惑?而且这个实现非常低效。 - padu

0

请注意,此帖子的当前标题为在 Pandas 数据框中查找最接近的点,但 OP 的尝试表明他们正在寻找包含一个点的区域。

可以利用 geopandas 库来优雅且高效地执行此操作。

将 DataFrame 转换为 GeoDataFrame。

然后,聚合 df1 中的点以创建多边形。聚合操作称为dissolve

最后,使用空间连接并指定谓词,使得 df2 中的点被表示Zones中的多边形覆盖,并输出Lat,Lon&Zone列。

# set up
import pandas as pd
import geopandas as gpd

df1 = pd.DataFrame({
  'Id': [1, 1, 1, 1, 2, 2, 2, 2],
  'Order': [1, 2, 3, 4, 1, 2, 3, 4],
  'Lat': [50.6373473, 50.63740441, 50.63744285, 50.63737839, 50.6376054, 50.6375896, 50.6374239, 50.6374404], 
  'Lon': [3.075029928, 3.075068636, 3.074951754, 3.074913884, 3.0750528, 3.0751209, 3.0750246, 3.0749554],
 'Zone': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B']
})

df2 = pd.DataFrame({
  'Lat': [50.6375524099, 50.6375714407],
  'Lon': [3.07507914474, 3.07508201591] 
})

# convert to GeoDataFrame
df1 = gpd.GeoDataFrame(df1, geometry=gpd.points_from_xy(df1.Lon, df1.Lat))
df2 = gpd.GeoDataFrame(df2, geometry=gpd.points_from_xy(df2.Lon, df2.Lat))

# aggregate & merge
zones = df1.dissolve(by='Zone').convex_hull.rename('geometry').reset_index()
merged = df2.sjoin(zones, how='left', predicate='covered_by')

# output
output_columns = ['Lat', 'Lon', 'Zone']
merged[output_columns]

这将输出:

         Lat       Lon Zone
0  50.637552  3.075079    B
1  50.637571  3.075082    B

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接