在数据框中找到最佳行

Question

在数据框中找到最佳行

5

我有一个带有一些位置的数据集：

ex <- data.frame(lat = c(55, 60, 40), long = c(6, 6, 10))

然后我有气候数据。

clim <- structure(list(lat = c(55.047, 55.097, 55.146, 55.004, 55.054, 
55.103, 55.153, 55.202, 55.252, 55.301), long = c(6.029, 6.0171, 
6.0051, 6.1269, 6.1151, 6.1032, 6.0913, 6.0794, 6.0675, 6.0555
), alt = c(0.033335, 0.033335, 0.033335, 0.033335, 0.033335, 
0.033335, 0.033335, 0.033335, 0.033335, 0.033335), x = c(0, 0, 
0, 0, 0, 0, 0, 0, 0, 0), y = c(1914, 1907.3, 1901.8, 1921.1, 
1914.1, 1908.3, 1902.4, 1896, 1889.8, 1884)), row.names = c(NA, 
10L), class = "data.frame", .Names = c("lat", "long", "alt", 
"x", "y"))

      lat   long      alt x      y
1  55.047 6.0290 0.033335 0 1914.0
2  55.097 6.0171 0.033335 0 1907.3
3  55.146 6.0051 0.033335 0 1901.8
4  55.004 6.1269 0.033335 0 1921.1
5  55.054 6.1151 0.033335 0 1914.1
6  55.103 6.1032 0.033335 0 1908.3
7  55.153 6.0913 0.033335 0 1902.4
8  55.202 6.0794 0.033335 0 1896.0
9  55.252 6.0675 0.033335 0 1889.8
10 55.301 6.0555 0.033335 0 1884.0

我想要做的是将两个数据集“合并”，以在 ex 文件中拥有气候数据。在 ex 中，lat 和 long 的值与 clim 中的值不同，因此它们不能直接合并（对于long也是一样的）。我需要找到最佳点（在考虑 lat 和 long 的情况下，为每行在 clim 中找到最接近的点）。

例子的期望输出如下：

  lat long      alt x      y
1  55    6 0.033335 0 1914.0
2  60    6 0.033335 0 1884.0
3  40   10 0.033335 0 1921.1

- Mateusz1981

计算错误，已更新。 - Mateusz1981

可能是Geographic / geospatial distance between 2 lists of lat/lon points (coordinates)的重复问题。 - Aramis7d

我成功让@andrew_reece的答案起作用，并将其标记为答案。当我让其他解决方案起作用时，我会重新考虑我的选择。非常感谢所有评论。 - Mateusz1981

2个回答

1

您可以在clim中找到与ex的lat和long的绝对差最小的行索引，然后根据该索引将clim列添加到ex中。

import(tidyverse)

ex %>%
  group_by(lat, long) %>%
  summarise(closest_clim = which.min(abs(lat - clim$lat) + 
                                       abs(long - clim$long))) %>%
  mutate(alt = clim$alt[closest_clim],
         x = clim$x[closest_clim],
         y = clim$y[closest_clim])

# A tibble: 3 x 6
# Groups:   lat [3]
    lat  long closest_clim    alt     x     y
  <dbl> <dbl>        <int>  <dbl> <dbl> <dbl>
1   40.   10.            4 0.0333    0. 1921.
2   55.    6.            1 0.0333    0. 1914.
3   60.    6.           10 0.0333    0. 1884.

- andrew_reece

当我将示例扩展到整个数据集时，我应该关注“警告：在lat-clim $ lat中：长对象长度不是短对象长度的倍数”吗？ - Mateusz1981

@Mateusz1981，这些警告很可能是由于ex中存在重复点而导致的，从而导致lat和long的分组长度>1。由于它们都是相同的点，您可以通过在summarise表达式中放置first(lat)和first(long)来消除警告。 - janusvm

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- janusvm · Accepted Answer

函数dist可用于计算矩阵或数据框中所有点之间的欧几里得（或其他）距离，因此找到与ex中最接近的clim中的点的方法为

# Distance between all points in ex and clim combined,
# with distances between points in same matrix filtered out
n <- nrow(ex)
tmp <- as.matrix(dist(rbind(ex, clim[, 1:2])))[-(1:n), 1:n]

# Indices in clim corresponding to the closest points to those in ex
idx <- apply(tmp, 2, which.min)

# Points from ex with additional info from closest points in clim
cbind(ex, clim[idx, -(1:2)])
#>    lat long      alt x      y
#> 1   55    6 0.033335 0 1914.0
#> 10  60    6 0.033335 0 1884.0
#> 4   40   10 0.033335 0 1921.1