我会尝试内部连接两个数据集:其中一个拥有50,000个观测值的df1
看起来像这样:
Name | Line.1 | Line.2 | Town | County | Postcode
-------------------|------------------|------------|------------|--------------|----------
ACME Inc | 63 Long Street | | Fakeington | Lincolnshire | PA4 8QU
BETA LTD | 91a | Main Drove | Cloud City | Something | BN1 6LD
The Giga | 344 Lorem Street | | Ipsom | Dolor | G2 8LY
df2
包含 500000 个观测,看起来像这样:
Name | AddressLine1 | AddressLine2 | AddressLine3 | AddressLine4 | Postcode | RatingValue
-------------------|----------------|------------------|--------------|--------------|----------|-------------
ACME | | 63 Long Street | Fakeington | Lincolnshire | PA4 8QU | 1
Random Company | | Rose Ave | Fakeington | | AB2 51GL | 5
BETA Limited | Business House | 91a Main Drove | Something | | BN1 6LD | 3
Giga Incorporated | | 344 Lorem Street | Ipsum | Dolor | G2 8LY | 5
我想得到类似于df_final
这样的东西。
Name | Postcode | RatingValue
-------------------|----------|-------------
ACME Inc | PA4 8QU | 1
BETA LTD | BN1 6LD | 3
Giga Incorporated | G2 8LY | 5
这些是一对一的匹配,df1
中的所有值都应该存在于df2
中。 Postcode
完全匹配,而地址则拆分成多行,并没有规律的模式,因此我认为最好的方法是通过Name
进行匹配。
我尝试了fuzzyjoin
包,但是出现了Error: cannot allocate vector of size 120.6 Gb
的错误,所以我想我必须使用另一种适用于更大数据集的方法。
你有什么关于如何处理这个问题的最佳方法的想法吗?
df1 <- data.frame(
stringsAsFactors = FALSE,
Name = c("ACME Inc", "BETA LTD", "Giga Incorporated"),
Line.1 = c("63 Long Street", "91a", "344 Lorem Street"),
Line.2 = c(NA, "Main Drove", NA),
Town = c("Fakeington", "Cloud City", "Ipsom"),
County = c("Lincolnshire", "Something", "Dolor"),
Postcode = c("PA4 8QU", "BN1 6LD", "G2 8LY")
)
df2 <- data.frame(
stringsAsFactors = FALSE,
Name = c("ACME", "Random Company","BETA Limited","Giga Incorporated"),
AddressLine1 = c(NA, NA, "Business House", NA),
AddressLine2 = c("63 Long Street", "Rose Ave","91a Main Drove","344 Lorem Street"),
AddressLine3 = c("Fakeington", "Fakeington", "Something", "Ipsum"),
AddressLine4 = c("Lincolnshire", NA, NA, "Dolor"),
Postcode = c("PA4 8QU", "AB2 51GL", "BN1 6LD", "G2 8LY"),
RatingValue = c(1L, 5L, 3L, 5L)
)
df1
有50K行,df2
有500K行,50000*500000
等于2.5e+10
。 - Rui Barradas