我有一个示例数据集,其中一列的内容类似于:
Candy
Sanitizer
Candy
Water
Cake
Candy
Ice Cream
Gum
Candy
Coffee
我想做的是将其替换为只有两个因素 -“糖果”和“非糖果”。我可以使用Python/Pandas来实现,但似乎无法找到基于dplyr的解决方案。谢谢!在 dplyr
和 tidyr
中
dat %>%
mutate(var = replace(var, var != "Candy", "Not Candy"))
比使用ifelse方法快得多。 创建初始数据框的代码可以如下:
library(dplyr)
dat <- as_data_frame(c("Candy","Sanitizer","Candy","Water","Cake","Candy","Ice Cream","Gum","Candy","Coffee"))
colnames(dat) <- "var"
使用dplyr
和case_when
的另一种解决方案:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
TRUE ~ 'Non-Candy'))
< p > < code > case_when 的语法为 < code > condition ~ 代替的值 。文档在此处。
与使用replace
解决方案相比可能效率较低,但优势在于可以在单个命令中执行多个替换,同时仍然易于阅读,即用于生成三个级别的替换:
dat %>%
mutate(var = case_when(var == 'Candy' ~ 'Candy',
var == 'Water' ~ 'Water',
TRUE ~ 'Neither-Water-Nor-Candy'))
dat
,列名为var
:dat = dat %>% mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
无需使用 dplyr
。假设 var
已存储为因子:
non_c <- setdiff(levels(dat$var), "Candy")
levels(dat$var) <- list(Candy = "Candy", "Non-Candy" = non_c)
请查看?levels
。
这比ifelse
方法要高效得多,而ifelse
方法则很可能会很慢:
library(microbenchmark)
set.seed(01239)
# resample data
smp <- data.frame(sample(dat$var, 1e6, TRUE))
names(smp) <- "var"
timings <- replicate(50, {
# copy data to facilitate reuse
cop <- smp
t0 <- get_nanotime()
levs <- setdiff(levels(cop$var), "Candy")
levels(cop$var) <- list(Candy = "Candy", "Non-Candy" = levs)
t1 <- get_nanotime() - t0
cop <- smp
t0 <- get_nanotime()
cop = cop %>%
mutate(candy.flag = factor(ifelse(var == "Candy", "Candy", "Non-Candy")))
t2 <- get_nanotime() - t0
cop <- smp
t0 <- get_nanotime()
cop$var <-
factor(cop$var == "Candy", labels = c("Non-Candy", "Candy"))
t3 <- get_nanotime() - t0
c(levels = t1, dplyr = t2, direct = t3)
})
x <- apply(times, 1, median)
x[2]/x[1]
# dplyr direct
# 8.894303 4.962791
factor(dat$var == "Candy", labels = c("非糖果", "糖果"))
,不过我认为重新设置水平是一个好的选择。 - Rich Scriven我没有对此进行基准测试,但至少在某些具有多个条件的情况下,mutate和列表的组合似乎提供了一种简单的解决方案:
# assuming that all sweet things fall in one category
dat <- data.frame(var = c("Candy", "Sanitizer", "Candy", "Water", "Cake", "Candy", "Ice Cream", "Gum", "Candy", "Coffee"))
conditions <- list("Candy" = TRUE, "Sanitizer" = FALSE, "Water" = FALSE,
"Cake" = TRUE, "Ice Cream" = TRUE, "Gum" = TRUE, "Coffee" = FALSE)
dat %>% mutate(sweet = conditions[var])
dat %>%
mutate(
var = ifelse(var == "Candy", "Candy", "Non-Candy")
)
dplyr
的 case_match
。library(dplyr)
dat %>%
mutate(var = case_match(var, "Candy" ~ var, .default ~ "Not Candy"))
var
的函数? - Julien