Python位置,显示距离最近的其他位置的距离。

10

我是数据框中的一个位置,位于纬度和经度列名称下方。我想展示它与另一个数据框中最近火车站的距离。

例如,假设我有一个纬度和经度为(37.814563 144.970267)的点,以及以下其他地理空间点的列表。我想找到最接近的点,然后在郊区数据框中作为额外的一列计算这些点之间的距离。

这是火车数据集的示例。

<bound method NDFrame.to_clipboard of   STOP_ID                                          STOP_NAME   LATITUDE  \
0   19970             Royal Park Railway Station (Parkville) -37.781193   
1   19971  Flemington Bridge Railway Station (North Melbo... -37.788140   
2   19972         Macaulay Railway Station (North Melbourne) -37.794267   
3   19973   North Melbourne Railway Station (West Melbourne) -37.807419   
4   19974        Clifton Hill Railway Station (Clifton Hill) -37.788657   

    LONGITUDE TICKETZONE                                          ROUTEUSSP  \
0  144.952301          1                                            Upfield   
1  144.939323          1                                            Upfield   
2  144.936166          1                                            Upfield   
3  144.942570          1  Flemington,Sunbury,Upfield,Werribee,Williamsto...   
4  144.995417          1                                 Mernda,Hurstbridge   

                      geometry  
0  POINT (144.95230 -37.78119)  
1  POINT (144.93932 -37.78814)  
2  POINT (144.93617 -37.79427)  
3  POINT (144.94257 -37.80742)  
4  POINT (144.99542 -37.78866)  >

这是郊区的一个例子

<bound method NDFrame.to_clipboard of       postcode              suburb state        lat         lon
4901      3000           MELBOURNE   VIC -37.814563  144.970267
4902      3002      EAST MELBOURNE   VIC -37.816640  144.987811
4903      3003      WEST MELBOURNE   VIC -37.806255  144.941123
4904      3005  WORLD TRADE CENTRE   VIC -37.822262  144.954856
4905      3006           SOUTHBANK   VIC -37.823258  144.965926>

我想要展示的是,在郊区列表中新增一列,其中包含从纬度和经度到最近火车站的距离。

使用一种解决方案得到了奇怪的输出,不确定是否正确?

两个解决方案都被展示了,

from sklearn.neighbors import NearestNeighbors
from haversine import haversine

NN = NearestNeighbors(n_neighbors=1, metric='haversine')
NN.fit(trains_shape[['LATITUDE', 'LONGITUDE']])

indices = NN.kneighbors(df_complete[['lat', 'lon']])[1]
indices = [index[0] for index in indices]
distances = NN.kneighbors(df_complete[['lat', 'lon']])[0]
df_complete['closest_station'] = trains_shape.iloc[indices]['STOP_NAME'].reset_index(drop=True)
df_complete['closest_station_distances'] = distances
print(df_complete)

这里的输出,

<bound method NDFrame.to_clipboard of    postcode        suburb state        lat         lon  Venues Cluster  \
1      3040    aberfeldie   VIC -37.756690  144.896259             4.0   
2      3042  airport west   VIC -37.711698  144.887037             1.0   
4      3206   albert park   VIC -37.840705  144.955710             0.0   
5      3020        albion   VIC -37.775954  144.819395             2.0   
6      3078    alphington   VIC -37.780767  145.031160             4.0   

                     #1                    #2             #3  \
1                  Café     Electronics Store  Grocery Store   
2  Fast Food Restaurant                  Café    Supermarket   
4                  Café                   Pub    Coffee Shop   
5                  Café  Fast Food Restaurant  Grocery Store   
6                  Café                  Park            Bar   

                      #4  ...                             #6  \
1            Coffee Shop  ...                         Bakery   
2          Grocery Store  ...             Italian Restaurant   
4         Breakfast Spot  ...                   Burger Joint   
5  Vietnamese Restaurant  ...                            Pub   
6            Pizza Place  ...  Vegetarian / Vegan Restaurant   

                      #7                   #8                         #9  \
1          Shopping Mall  Japanese Restaurant          Indian Restaurant   
2  Portuguese Restaurant    Electronics Store  Middle Eastern Restaurant   
4                    Bar               Bakery                  Gastropub   
5     Chinese Restaurant                  Gym                     Bakery   
6     Italian Restaurant            Gastropub                     Bakery   

                 #10 Ancestry Cluster  ClosestStopId  \
1   Greek Restaurant              8.0          20037   
2  Convenience Store              5.0          20032   
4              Beach              6.0          22180   
5  Convenience Store              5.0          20004   
6        Coffee Shop              5.0          19931   

                                   ClosestStopName  \
1              Essendon Railway Station (Essendon)   
2                Glenroy Railway Station (Glenroy)   
4  Southern Cross Railway Station (Melbourne City)   
5          Albion Railway Station (Sunshine North)   
6          Alphington Railway Station (Alphington)   

                                   closest_station closest_station_distances  
1                Glenroy Railway Station (Glenroy)                  0.019918  
2  Southern Cross Railway Station (Melbourne City)                  0.031020  
4          Alphington Railway Station (Alphington)                  0.023165  
5                  Altona Railway Station (Altona)                  0.005559  
6                Newport Railway Station (Newport)                  0.002375  

还有第二个功能。

def ClosestStop(r):
    # Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
    distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
    
    # Stop with minimum Distance from the Suburb
    closestStationId = distances[distances == distances.min()].index.to_list()[0]
    return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]

df_complete[['ClosestStopId', 'ClosestStopName']] = df_complete.apply(ClosestStop, axis=1)

奇怪的是,这个代码给出了不同的答案,这让我想到代码存在问题。 KM 的结果似乎也不对。

完全不知道如何解决这个问题 - 希望能在这里得到一些指导,谢谢!


1
你需要以下三个步骤来完成这个编程任务:1. 编写一个函数 distance(lat1, lon1, lat2, lon2),2. 对每个郊区和车站的组合应用该函数,3. 选择每个郊区距离最近的车站并将其添加到数据框中。(或者使用 sklearn 中的 NearestNeighbor 分类器) - Niklas Mertsch
请在此处查看答案 https://dev59.com/enRC5IYBdhLWcg3wP-dh - RichieV
1
使用第一种解决方案时,您在NN中使用'haversine'作为距离函数,这是sklearn中内置的haversine距离,以半径表示。您可以在我的答案中看到文档链接。要使用以公里表示的haversine距离,请在NN中使用导入的haversine包作为距离。这也在我的答案中表达了出来。 - SoufianeK
你能分享一下你想计算距离的城市和车站数量吗?我还没有在这里提供可扩展的BallTree算法示例,当数字规模增加时,这是你需要的。 - Willem Hendriks
4个回答

7

一些关键概念

  1. 执行两个数据帧之间的笛卡尔积,以获取所有组合 (在两个数据帧之间连接相同值的方法是 foo=1)
  2. 将两组数据放在一起后,根据经纬度计算距离) 使用 geopy 进行计算
  3. 清理列,使用 sort_values() 查找最短距离
  4. 最后使用 groupby()agg() 获取最短距离的 第一个

有两个可用的数据帧

  1. dfdist 包含所有组合和距离
  2. dfnearest 包含结果
dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
 'STOP_NAME': ['Royal Park Railway Station (Parkville)',
  'Flemington Bridge Railway Station (North Melbo...',
  'Macaulay Railway Station (North Melbourne)',
  'North Melbourne Railway Station (West Melbourne)',
  'Clifton Hill Railway Station (Clifton Hill)'],
 'LATITUDE': ['-37.781193',
  '-37.788140',
  '-37.794267',
  '-37.807419',
  '-37.788657'],
 'LONGITUDE': ['144.952301',
  '144.939323',
  '144.936166',
  '144.942570',
  '144.995417'],
 'TICKETZONE': ['1', '1', '1', '1', '1'],
 'ROUTEUSSP': ['Upfield',
  'Upfield',
  'Upfield',
  'Flemington,Sunbury,Upfield,Werribee,Williamsto...',
  'Mernda,Hurstbridge'],
 'geometry': ['POINT (144.95230 -37.78119)',
  'POINT (144.93932 -37.78814)',
  'POINT (144.93617 -37.79427)',
  'POINT (144.94257 -37.80742)',
  'POINT (144.99542 -37.78866)']})
dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
 'postcode': ['3000', '3002', '3003', '3005', '3006'],
 'suburb': ['MELBOURNE',
  'EAST MELBOURNE',
  'WEST MELBOURNE',
  'WORLD TRADE CENTRE',
  'SOUTHBANK'],
 'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
 'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
 'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})

import geopy.distance
# cartesian product so we get all combinations
dfdist = (dfsub.assign(foo=1).merge(dfstat.assign(foo=1), on="foo")
    # calc distance in km between each suburb and each train station
     .assign(km=lambda dfa: dfa.apply(lambda r: 
                                      geopy.distance.geodesic(
                                          (r["LATITUDE"],r["LONGITUDE"]), 
                                          (r["lat"],r["lon"])).km, axis=1))
    # reduce number of columns to make it more digestable
     .loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
    # sort so shortest distance station from a suburb is first
     .sort_values(["postcode","suburb","km"])
    # good practice
     .reset_index(drop=True)
)
# finally pick out stations nearest to suburb
# this can easily be joined back to source data frames as postcode and STOP_ID have been maintained
dfnearest = dfdist.groupby(["postcode","suburb"])\
    .agg({"STOP_ID":"first","STOP_NAME":"first","km":"first"}).reset_index()

print(dfnearest.to_string(index=False))
dfnearest

输出

postcode              suburb STOP_ID                                         STOP_NAME        km
    3000           MELBOURNE   19973  North Melbourne Railway Station (West Melbourne)  2.564586
    3002      EAST MELBOURNE   19974       Clifton Hill Railway Station (Clifton Hill)  3.177320
    3003      WEST MELBOURNE   19973  North Melbourne Railway Station (West Melbourne)  0.181463
    3005  WORLD TRADE CENTRE   19973  North Melbourne Railway Station (West Melbourne)  1.970909
    3006           SOUTHBANK   19973  North Melbourne Railway Station (West Melbourne)  2.705553

减少测试组合大小的方法
# pick nearer places,  based on lon/lat then all combinations
dfdist = (dfsub.assign(foo=1, latr=dfsub["lat"].round(1), lonr=dfsub["lon"].round(1))
          .merge(dfstat.assign(foo=1, latr=dfstat["LATITUDE"].round(1), lonr=dfstat["LONGITUDE"].round(1)), 
                 on=["foo","latr","lonr"])
    # calc distance in km between each suburb and each train station
     .assign(km=lambda dfa: dfa.apply(lambda r: 
                                      geopy.distance.geodesic(
                                          (r["LATITUDE"],r["LONGITUDE"]), 
                                          (r["lat"],r["lon"])).km, axis=1))
    # reduce number of columns to make it more digestable
     .loc[:,["postcode","suburb","STOP_ID","STOP_NAME","km"]]
    # sort so shortest distance station from a suburb is first
     .sort_values(["postcode","suburb","km"])
    # good practice
     .reset_index(drop=True)
)

嗨,这很棒,但当我用它处理属性时,它会占用所有的内存:P 有更有效率的方法或批量处理的方法吗? - LeCoda
如果您有大量数据集,那么纯笛卡尔积可能会导致问题...您是否在多个城市拥有地址和站点?如果是这样,我建议在生成dfdest时将城市添加到连接键中。即不要生成无关的组合... - Rob Raymond
好的观点。这只针对一个城市,特别是与火车站(和公交车等)的地址距离有关。我在考虑批处理或其他什么方法? - LeCoda
1
刚刚添加到答案中作为一个想法。我期望更近的位置具有相同的经纬度四舍五入值。 - Rob Raymond

6

试试这个

import pandas as pd
def ClosestStop(r):
    # Cartesin Distance: square root of (x2-x2)^2 + (y2-y1)^2
    distances = ((r['lat']-StationDf['LATITUDE'])**2 + (r['lon']-StationDf['LONGITUDE'])**2)**0.5
    
    # Stop with minimum Distance from the Suburb
    closestStationId = distances[distances == distances.min()].index.to_list()[0]
    return StationDf.loc[closestStationId, ['STOP_ID', 'STOP_NAME']]

StationDf = pd.read_excel("StationData.xlsx")
SuburbDf = pd.read_excel("SuburbData.xlsx")

SuburbDf[['ClosestStopId', 'ClosestStopName']] = SuburbDf.apply(ClosestStop, axis=1)
print(SuburbDf)

1
笛卡尔距离不适用于GPS坐标距离计算。请参考哈弗赛因公式 https://en.m.wikipedia.org/wiki/Haversine_formula - SoufianeK
@SoufianeK 是的,当您处理几度的纬度和经度变化(即全球距离)时,笛卡尔距离并不适用。但是这里的目标是找到最接近郊区的铁路站,其范围仅覆盖一个度(纬度和经度)。此外,这里距离的大小和单位并不重要,只有距离如何比较重要。因此,笛卡尔距离对于此目的已足够好。感谢分享链接,我从事GIS地图制作,这将非常有帮助。 - Kuldip Chaudhari

5

你可以使用 sklearn.neighbors.NearestNeighbors 和 haversine 距离来进行编程。

import pandas as pd
dfstat = pd.DataFrame({'STOP_ID': ['19970', '19971', '19972', '19973', '19974'],
                       'STOP_NAME': ['Royal Park Railway Station (Parkville)',  'Flemington Bridge Railway Station (North Melbo...',  'Macaulay Railway Station (North Melbourne)',  'North Melbourne Railway Station (West Melbourne)',  'Clifton Hill Railway Station (Clifton Hill)'],
                       'LATITUDE': ['-37.781193', '-37.788140',  '-37.794267',  '-37.807419',  '-37.788657'],
                       'LONGITUDE': ['144.952301', '144.939323', '144.936166',  '144.942570',  '144.995417'],
                       'TICKETZONE': ['1', '1', '1', '1', '1'], 
                       'ROUTEUSSP': ['Upfield',  'Upfield',  'Upfield',  'Flemington,Sunbury,Upfield,Werribee,Williamsto...',  'Mernda,Hurstbridge'],
                       'geometry': ['POINT (144.95230 -37.78119)',  'POINT (144.93932 -37.78814)',  'POINT (144.93617 -37.79427)',  'POINT (144.94257 -37.80742)',  'POINT (144.99542 -37.78866)']})
dfsub = pd.DataFrame({'id': ['4901', '4902', '4903', '4904', '4905'],
                      'postcode': ['3000', '3002', '3003', '3005', '3006'],
                      'suburb': ['MELBOURNE',  'EAST MELBOURNE',  'WEST MELBOURNE',  'WORLD TRADE CENTRE',  'SOUTHBANK'],
                      'state': ['VIC', 'VIC', 'VIC', 'VIC', 'VIC'],
                      'lat': ['-37.814563', '-37.816640', '-37.806255', '-37.822262', '-37.823258'],
                      'lon': ['144.970267', '144.987811', '144.941123', '144.954856', '144.965926']})

让我们从在数据框中找到离某个随机点最近的点开始,比如说-37.814563, 144.970267

NN = NearestNeighbors(n_neighbors=1, metric='haversine')
NN.fit(dfstat[['LATITUDE', 'LONGITUDE']])
NN.kneighbors([[-37.814563, 144.970267]])

输出结果为 (array([[2.55952637]]), array([[3]])),是数据帧中最接近点的距离和索引。在Sklearn中,Haversine距离以半径表示。如果要以公里计算,则可以使用Haversine
from haversine import haversine
NN = NearestNeighbors(n_neighbors=1, metric=haversine)
NN.fit(dfstat[['LATITUDE', 'LONGITUDE']])
NN.kneighbors([[-37.814563, 144.970267]])

输出(array([[2.55952637]]), array([[3]]))表示距离为公里。
现在,您可以将其应用于数据框中的所有点,并获得最近站点的索引。
indices = NN.kneighbors(dfsub[['lat', 'lon']])[1]
indices = [index[0] for index in indices]
distances = NN.kneighbors(dfsub[['lat', 'lon']])[0]
dfsub['closest_station'] = dfstat.iloc[indices]['STOP_NAME'].reset_index(drop=True)
dfsub['closest_station_distances'] = distances
print(dfsub)
id  postcode    suburb  state   lat lon closest_station closest_station_distances
0   4901    3000    MELBOURNE   VIC -37.814563  144.970267  North Melbourne Railway Station (West Melbourne)    2.559526
1   4902    3002    EAST MELBOURNE  VIC -37.816640  144.987811  Clifton Hill Railway Station (Clifton Hill) 3.182521
2   4903    3003    WEST MELBOURNE  VIC -37.806255  144.941123  North Melbourne Railway Station (West Melbourne)    0.181419
3   4904    3005    WORLD TRADE CENTRE  VIC -37.822262  144.954856  North Melbourne Railway Station (West Melbourne)    1.972010
4   4905    3006    SOUTHBANK   VIC -37.823258  144.965926  North Melbourne Railway Station (West Melbourne)    2.703926

我刚刚添加了一个距离列,一些导入,并且修正了一个缺失的括号。 - SoufianeK
我复制了代码,在我的示例中它是否正常工作?看起来距离的数量级有误吗? - LeCoda
@MichaelHolborn 这很令人惊讶,因为haversine库是计算距离的参考,而sklearn库是机器学习任务中寻找最近邻居的参考。我在生产中使用它们进行类似的任务。你能给我提供一个示例代码及其结果,以便找出问题所在吗? - SoufianeK
当然 - 也许是数据类型的问题吗?我会在上面提供链接。 - LeCoda
你的代码完美运行,但它不能处理我所拥有的数据集。经纬度是 obj(64) 类型。 - LeCoda
显示剩余5条评论

1

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接