如何在R中基于行和列合并两个表格?

3
df1 <- data.frame(MLID=c('992','992','BJR'),
              Position=c('N0','N1','N1'),
              Weight=c(0.125,0.58,0.69))


df2 <- data.frame(MLID=c('992','992','992','992',
                     'BJR','BJR','BJR','BJR'),
              Weight=c(0,0.251,0.501,1.001,
                       0,0.251,0.501,1.001),
              N0=c(2.80,4.05,4.05,4.05,
                   4.05,4.05,4.05,4.05),
              N1=c(3.47,4.73,4.95,5.15,
                   4.73,7.73,4.95,5.15) )

我想要的是将这两个表格合并,按照以下规则进行:

  1. 根据MLID进行合并
  2. 查看位置(N0或N1)
  3. 找到重量所在范围(类似于Excel中的近似VLOOKUP函数)(例如2.8表示在992 N0上负责重量为(0,0.250)的收费,4.05表示在992 N0上负责重量为(0.251,0.500)的收费,3.47表示在992 N1上负责重量为(0,0.250)的收费等)

因此最终输出应该是:

MILD  Position  Weight  Charge
992      N0     0.125    2.8
992      N1     0.580    4.95
BJR      N1     0.690    4.95

能否在R中完成这个任务?尤其是在dplyr包中?

1
请尝试以下代码:library(data.table);setnames(melt(setDT(df2), measure = c("NO", "N1"), variable.name = "Position", value.name = "Charge"), "Weight", "wt")[df1, on = .(MLID, Position, wt <= Weight), mult = "last"] - akrun
4个回答

4

可以使用 data.tablerolling join 选项。首先,需要使用 meltdf2 转换为 long-format,然后将 df1df2 进行连接。

library(data.table)

setDT(df1, key = c("MLID", "Position","Weight") )

df2 <- melt(df2, id.vars = c("MLID","Weight"), variable.name = "Position", 
                                                      value.name = "Charge")

setDT(df2, key = c("MLID", "Position","Weight"))

df2[df1, roll = "nearest"]
#    MLID Weight Position Charge
# 1:  992  0.580       N1   4.95
# 2:  992  0.125       NO   2.80
# 3:  BJR  0.690       N1   4.95

选项#2: 基于 tidyverse 的方法如下:

library(tidyverse)
df2 %>% gather(Position, Charge, -MLID, -Weight) %>%
  right_join(df1, by=c("MLID", "Position")) %>%
  filter(Weight.x <= Weight.y) %>%
  group_by(MLID, Position) %>%
  arrange(Weight.y-Weight.x) %>% 
  slice(1) %>%
  select(MLID, Weight = Weight.y, Position, Charge)

# # A tibble: 3 x 4
# # Groups: MLID, Position [3]
#   MLID  Weight Position Charge
#   <chr>  <dbl> <chr>     <dbl>
# 1 992    0.580 N1         4.95
# 2 992    0.125 NO         2.80
# 3 BJR    0.690 N1         4.95

数据:

OP的数据略有修改,包括在data.frame中添加stringsAsFactors = FALSE参数,以避免不必要的警告。

df1 <- data.frame(MLID=c('992','992','BJR'),
                  Position=c('NO','N1','N1'),
                  Weight=c(0.125,0.58,0.69), stringsAsFactors = FALSE)


df2 <- data.frame(MLID=c('992','992','992','992',
                         'BJR','BJR','BJR','BJR'),
                  Weight=c(0,0.251,0.501,1.001,
                           0,0.251,0.501,1.001),
                  NO=c(2.80,4.05,4.05,4.05,
                       4.05,4.05,4.05,4.05),
                  N1=c(3.47,4.73,4.95,5.15,
                       4.73,7.73,4.95,5.15), stringsAsFactors = FALSE )

谢谢您的回答。我以前从未接触过data.table。这里的“rolling join”似乎只选择最近的值。让我澄清一下。与重量相同行的费用,例如N1 BJR Weight 0.501的4.95,这意味着N1 BJR上重量范围在(0.501,1.001)内的重量需要支付4.95。根据您的回答,如果df1中最后一个重量输入从0.69更改为0.88,您的输出将进入下一个收费级别,而不是停留在(0.501,1.001)的收费级别中。 - Frank Zhang
@DingruiZhang,使用roll有很多选项。我使用了“nearest”,但你可以尝试使用roll = TRUE,它是最后一个值向前传递。或者使用-Inf来传递下一个值。 - MKR

2
我们可以使用data.table进行非等值连接。使用melt将第二个数据集重塑为“长”格式,并在“MLID”,“Position”上与第一个数据集进行连接,在“Weight”列上进行非等值比较,并分配'Charge'的last值来创建'df1'中的列。
library(data.table)
setDT(df1)[setnames(melt(setDT(df2), measure = c("NO", "N1"), 
       variable.name = "Position", value.name = "Charge"), "Weight", "wt"), 
      Charge := Charge, on = .(MLID, Position, Weight > wt), mult = "last"] 

df1
#   MLID Position Weight Charge
#1:  992       NO  0.125   2.80
#2:  992       N1  0.580   4.95
#3:  BJR       N1  0.690   4.95

我无法使用melt()重现此示例的结果,但我能够使用gather(Position, Charge, -MLID, -Weight)将其转换为长格式,然后代码就可以正常工作了:df2 <- df2 %>% gather(Position, Charge, -MLID, -Weight) setDT(df1)[setnames(df2,"Weight", "wt"), Charge := Charge, on = .(MLID, Position, Weight > wt), mult = "last"] df1。只是不明白为什么要使用“wt”。 - TheSciGuy
@TheSciGuy 这是在2018年发布的。也许版本有所更改。 - akrun

1
这是一个基于R语言的版本:
outdf <- merge(df1, df2, by = "MLID")
outdf$dist <- abs(outdf$Weight.x - outdf$Weight.y)
ting <- aggregate(dist ~ MLID + Position, FUN = function(x) min(x), data = outdf)
outdf2 <- merge(outdf, ting, by.x = c("MLID", "Position", "dist"))
outdf2$charge <- ifelse(outdf2$Position == "N1", outdf2$N1, outdf2$NO)
outdf2 <- outdf2[,c("MLID", "Position", "Weight.x", "charge")]
outdf2
# MLID Position Weight.x charge
# 1  992       N1    0.580   4.95
# 2  992       NO    0.125   2.80
# 3  BJR       N1    0.690   4.95

0
joined_df <- merge(data1, data2, by.x = "column key name in data1", 
         by.y = "column key name in data2", all.x = TRUE, all.y = FALSE)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接