我有两个数据框需要连接。其中,我总是有一个相互的“主”列进行连接,而我有时也可能有另一列数据要进行连接,除了主列之外。
如何指定一个可能的列进行连接?
示例
我用从mtcars
派生出来的两个数据集演示我的问题。两者都有一个“主”列(cars
)需要连接,有时可能会在一个或两个数据集中具有另一个相同的列(some_letters
)。
library(tidyverse)
create_df <- function(columns_to_include) {
mtcars %>%
rownames_to_column("cars") %>%
select(cars, {{ columns_to_include }}) %>%
slice_sample(n = 15) %>%
{if (sample(c(TRUE, FALSE), size = 1)) add_column(., some_letters = letters[1:15]) else .}
}
# both dataframes have "some_letters"
set.seed(123)
df_a1 <- create_df(carb)
df_a2 <- create_df(gear)
scenario_a <- inner_join(df_a1, df_a2, by = c("cars", "some_letters"))
scenario_a
#> cars carb some_letters gear
#> 1 Ford Pantera L 4 l 5
# neither dataframe has "some_letters"
set.seed(111)
df_b1 <- create_df(carb)
df_b2 <- create_df(gear)
scenario_b <- inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.
# one dataframe has "some_letters" but the other doesn't
set.seed(737)
df_c1 <- create_df(carb)
df_c2 <- create_df(gear)
scenario_c <- inner_join(df_c1, df_c2, by = c("cars", "some_letters"))
#> Error: Join columns must be present in data.
#> x Problem with `some_letters`.
本文档由 reprex package (v0.3.0) 于2021年02月20日创建
我们可以看到在scenario_a
中,联接成功了,因为df_a1
和df_a2
都包含some_letters
。然而,在scenario_b
中我们发现联接失败了,因为some_letters
不存在(任何数据集中)。类似地,scenario_c
显示了some_letters
出现在一个数据集中但另一个数据集中没有,因此联接失败了。
在联接数据时,我是否可以指定some_letters
可能出现,但不保证出现,这样当它在两个数据中同时出现时,它将成为一个额外的“联接-by”列,否则它将从by
参数中忽略?
期望输出
inner_join(df_b1, df_b2, by = c("cars", "some_letters"))
# as if we joined by `cars` only:
## cars carb gear
## 1 Porsche 914-2 2 5
## 2 Cadillac Fleetwood 4 3
## 3 Pontiac Firebird 2 3
## 4 Datsun 710 1 4
## 5 Merc 240D 2 4
## 6 Chrysler Imperial 4 3
## 7 Hornet 4 Drive 1 3
## 8 Camaro Z28 4 3