我有一个庞大的空间数据集(1200万行),其几何结构是地图上的点。对于数据集中的每一行,我想找到距离该点500米内的所有点。
在r中,使用sf,我一直在尝试通过并行循环遍历每一行并运行st_buffer和st_intersects,然后将结果保存为键值格式的列表(键为原始点,值为相邻点)来实现此目标。
问题在于数据集太大了。即使并行化到60个核心以上,操作也太长时间(>1周且通常崩溃)。
有什么替代这种蛮力方法的方法吗?是否可以使用sf建立索引?也许将操作推到外部数据库?
Reprex:
library(sf)
library(tidyverse)
library(parallel)
library(foreach)
# example data, convert to decimal:
nc <- st_read(system.file("shape/nc.shp", package="sf")) %>% st_transform(32618)
# expand the data a a bit to make the example more interesting:
nc <- rbind(nc,nc,nc)
nc <- nc %>% mutate(Id = row_number())
## can run in parallel if desired:
# num_cores <- parallel::detectCores()-2
# cl <- makeSOCKcluster(num_cores)
# registerDoSNOW(cl)
# or just run in sequence:
registerDoSEQ()
neighbors <- foreach(ii = 1:nrow(nc)
, .verbose = FALSE
, .errorhandling = "pass") %dopar% {
l = 500 # 500 meters
# isolate the row as the origin point:
row_interest <- filter(nc, row_number()==ii)
# create the buffer:
buffer <- row_interest %>% st_buffer(dist = l)
# extract the row numbers of the neighbors
comps_idx <- suppressMessages(st_intersects(buffer, nc))[[1]]
# get all the neighbors:
comps <- nc %>% filter(row_number() %in% comps_idx)
# remove the geometry:
comps <- comps %>% st_set_geometry(NULL)
# flow control in case there are no neibors:
if(nrow(comps)>0) {
comps$Origin_Key <- row_interest$Id
} else {
comps <- data_frame("lat" = NA_integer_,"lon" = NA_integer_, "bbl" = row_interest$bbl)
comps$Origin_Key <- row_interest$Id
}
return(comps)
}
closeAllConnections()
length(neighbors)==nrow(nc)
[1] TRUE