R：将转换为因子，并按case_when相同的级别顺序排序

Question

R：将转换为因子，并按case_when相同的级别顺序排序

24

在进行数据分析时，有时我需要将值重新编码为因子，以便进行分组分析。我希望保持因子的顺序与 case_when 中指定的转换顺序相同。在这种情况下，顺序应为 "Excellent" "Good" "Fail"。如何在不繁琐地再次提及它的情况下实现这一点，例如 levels=c('Excellent', 'Good', 'Fail')？

非常感谢。

library(dplyr, warn.conflicts = FALSE)             
                                                   
set.seed(1234)                                     
score <- runif(100, min = 0, max = 100)     
   
Performance <- function(x) {                       
  case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  ) %>% factor(levels=c('Excellent', 'Good', 'Fail'))
}                                                  
                                                   
performance <- Performance(score)                  
levels(performance)                                
#> [1] "Excellent" "Good"      "Fail"
table(performance)                                 
#> performance
#> Excellent      Good      Fail 
#>        15        30        55

- user5068121

1

这就是他不想做的事情（但已经在做了）。 - De Novo

1

这是一个不错的解决方案！ - Luke Hayden

太棒了，谢谢你！ - jzadra

2

为了允许在RHS上使用表达式，请在倒数第二行插入levels = sapply(levels, FUN = eval)。这使得可以执行result = fct_case_when(x < 5 ~ my_vec[3])，而不会将“my_vec [3]”作为result返回。 - Jonas Lindeløv

请勿将解决方案公告编辑到问题中。如果已有答案，请接受其中一个（即单击其旁边的“打勾”）。如果您的解决方案尚未被现有答案覆盖，您也可以创建自己的答案，并接受它。请参阅https://stackoverflow.com/help/self-answer进行比较。 - Yunnosch

5个回答

4

默认情况下，级别按字典顺序设置。如果您不想指定它们，可以将它们设置为使字典顺序正确（Performance1），或者创建一个 levels 向量，一次生成因子时使用它，并在设置级别时使用它（Performance2）。我不知道这些方法会为您节省多少工作量或繁琐性，但它们确实存在。请查看我的第三个建议，我认为那会是最不繁琐的方式。

Performance1 <- function(x) {                       
  case_when(
    is.na(x) ~ NA_character_,                          
    x > 80 ~ 'Excellent',  
    x <= 50 ~ 'Fail',
    TRUE ~ 'Good',
  ) %>% factor()
}

Performance2 <- function(x, levels = c("Excellent", "Good", "Fail")){
  case_when(
    is.na(x) ~ NA_character_,
    x > 80 ~ levels[1],
    x > 50 ~ levels[2],
    TRUE ~ levels[3]
  ) %>% factor(levels)
}
performance1 <- Performance1(score)
levels(performance1)
# [1] "Excellent" "Fail"     "Good"
table(performance1)
# performance1
# Excellent      Fail      Good 
#        15        55        30 

performance2 <- Performance2(score)
levels(performance2)
# [1] "Excellent" "Good"      "Fail"  
table(performance2)
# performance2
# Excellent      Good      Fail 
#        15        30        55

如果我能提出一个更简单的方法：

如果我能建议一种更不繁琐的方式：

performance <- cut(score, breaks = c(0, 50, 80, 100), 
                   labels = c("Fail", "Good", "Excellent"))
levels(performance)
# [1] "Fail"      "Good"      "Excellent"
table(performance)
# performance
#      Fail      Good Excellent 
#        55        30        15

- De Novo

我认为Performace2接近我所需的。在dplyr或forcats中是否有任何函数可以一步完成此操作？也就是说，不需要先保存级别。此外，cut函数对于将数值转换为因子很方便，尽管在这种情况下它会颠倒顺序（可以使用forcats::fct_rev轻松纠正）。谢谢。 - user5068121

1

我认为“Performance2”的缺点是我们无法立即看到相应的转换。例如，当看到“x>80〜levels [1]”时，我们必须寻找“levels”向量并查看其第一个元素，以便找出“x>80”对应于“Excellent”。因此，它对编程很方便，但在我看来降低了可读性。如果有人能提供既适合编程又易读的解决方案，那就太好了。 - user5068121

1

虽然我的解决方案用一个凌乱的中间变量代替了你的管道，但这个方法是可行的：

    library(dplyr, warn.conflicts = FALSE)             

set.seed(1234)                                     
score <- runif(100, min = 0, max = 100)     

Performance <- function(x) {                       
  t <- case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  ) 
  to <- subset(t, !duplicated(t))
  factor(t, levels=(to[order(subset(x, !duplicated(t)), decreasing=T)] ))
}                                                  
performance <- Performance(score)                
levels(performance)

编辑以修复！

- Luke Hayden

这个不起作用。它会产生错误 因子水平[2]是重复的。 - user5068121

这个可以工作。但是似乎很复杂，而且并没有节省太多打字的时间。不管怎样，谢谢！ - user5068121

我发现这个方法并不总是有效。例如，当分数为rbinom(10, size = 9, prob = .5)且条件更改为x %% 2 == 1 ~ 'Odd', x %% 2 == 0 ~ 'Even'时，有时级别的顺序是Odd Even，但有时是Even Odd，这与case_when中指定的顺序并不总是相同的。你正在使用order，所以我猜这种方法只在值具有合理顺序时才有效。 - user5068121

嗯，我认为更好的方法是创建一个包含两个向量的列表，一个包含有序阈值，另一个包含描述条件的因子，然后将此列表作为参数提供给函数。如果您想要使函数完全通用化，这将使其成为可能。 - Luke Hayden

1

这是我一直在使用的实现：

library(dplyr)
library(purrr)
library(rlang)
library(forcats)

factored_case_when <- function(...) {
  args <- list2(...)
  rhs <- map(args, f_rhs)
  
  cases <- case_when(
    !!!args
  )
  
  exec(fct_relevel, cases, !!!rhs)
}


numbers <- c(2, 7, 4, 3, 8, 9, 3, 5, 2, 7, 5, 4, 1, 9, 8)

factored_case_when(
  numbers <= 2 ~ "Very small",
  numbers <= 3 ~ "Small",
  numbers <= 6 ~ "Medium",
  numbers <= 8 ~ "Large",
  TRUE    ~ "Huge!"
)
#>  [1] Very small Large      Medium     Small      Large      Huge!     
#>  [7] Small      Medium     Very small Large      Medium     Medium    
#> [13] Very small Huge!      Large     
#> Levels: Very small Small Medium Large Huge!

这样做的好处是不必手动指定因子水平。

我还向dplyr提交了一个功能请求，希望能够实现这个功能：https://github.com/tidyverse/dplyr/issues/6029

- snakeoilsales

1

使用case_when()输出数字，并在factor()中使用labels参数：

library(dplyr, warn.conflicts = FALSE)
set.seed(1234)
score <- runif(100, min = 0, max = 100)

Performance <- function(x) {
  case_when(
    is.na(x) ~ NA_real_,
    x > 80   ~ 1,
    x > 50   ~ 2,
    TRUE     ~ 3
  ) %>% factor(labels=c('Excellent', 'Good', 'Fail'))
}

performance <- Performance(score)
levels(performance)
#> [1] "Excellent" "Good"      "Fail"
table(performance)
#> performance
#> Excellent      Good      Fail 
#>        15        30        55

^{使用reprex v2.0.2于2023年1月13日创建}

- its.me.adam

1

谢谢你提出这个建议。我只是觉得很遗憾case_when没有一个选项可以要求按照标签定义的方式对因子进行排序。 - undefined

1

正在进行中：https://github.com/tidyverse/forcats/issues/298 - undefined

1

不错！可能会包含在未来的版本中。 - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user5068121 · Accepted Answer

我的解决方案

最终，我想出了一个解决方案。对于那些感兴趣的人，这是我的解决方案。我编写了一个名为fct_case_when（假设是forcats中的函数）的函数。它只是case_when的包装器，并具有因子输出。级别的顺序与参数顺序相同。

fct_case_when <- function(...) {
  args <- as.list(match.call())
  levels <- sapply(args[-1], function(f) f[[3]])  # extract RHS of formula
  levels <- levels[!is.na(levels)]
  factor(dplyr::case_when(...), levels=levels)
}

现在，我可以使用 fct_case_when 替换 case_when，结果与以前的实现相同（但更简洁）。

Performance <- function(x) {                       
  fct_case_when(                                         
    is.na(x) ~ NA_character_,                          
    x > 80   ~ 'Excellent',                            
    x > 50   ~ 'Good',                                 
    TRUE     ~ 'Fail'                                  
  )
}      
performance <- Performance(score)                  
levels(performance)                       
#> [1] "Excellent" "Good"      "Fail"
table(performance)                
#> performance
#> Excellent      Good      Fail 
#>        15        30        55