使用增量从连续变量创建组。

Question

使用增量从连续变量创建组。

3

我试图从连续变量中按增量创建分类组。

score <- sample(1:100,20,replace=TRUE)
df <- data.frame(score)

我想创建基于分数列的20递增（不包括）的新分类列。它看起来会像这样：我还希望新分类列的名称以这种格式命名。

df <- df%>%
  mutate(G1_0_20 = case_when(score >= 0 & score <20 ~ 1),
         G2_20_40 = case_when(score >= 20 & score < 40 ~ 1),
         G3_40_60 = case_when(score >= 40 & score < 60 ~ 1),
         G4_60_80 = case_when(score >= 60 & score < 80 ~ 1),
         G5_80_100 = case_when(score >= 80 & score < 100 ~ 1))
df[is.na(df)] <- 0
df

我想知道是否有更简单、更快捷的方法来创建数据集，该数据集需要从0到4000的值中创建每20个一组的组。

另外，如果我想要从0到100的值中每20个一组，然后从200到300的值中每100个一组，怎么办呢？

非常感谢您提供的所有帮助！谢谢！

- Bruh

这并不完全回答你的问题，但可能有价值：你可以使用 findInterval(df$score, c(seq(20, 100, 20), 200, 300)) 轻松创建这些类别 - 在某些情况下，这将是首选格式，例如如果你正在运行模型或其他操作，并想要包括年龄类别而不是连续年龄。祝你好运！ - jpsmith

3个回答

2

为此，我们可��使用dplyover :: over（）并创建一个序列以使用seq（）循环。免责声明：该软件包不在CRAN上，我是维护者。

.names参数允许我们即时创建漂亮的名称，我们可以使用{x_idx}来访问索引元素的编号和{x}来访问迭代值。

下面的示例显示了从20到100的序列，但我们可以通过替换数字来生成任何序列。

set.seed(123)
score <- sample(1:100,20,replace=TRUE)
df <- data.frame(score)

library(dplyr)
library(dplyover) # https://timteafan.github.io/dplyover/

df %>% 
  mutate(over(seq(20, 100, 20),
              ~ if_else(score < .x & score > (.x - 20), 1, 0),
              .names = "G{x_idx}_{x - 20}_{x}"
  ))
#>    score G1_0_20 G2_20_40 G3_40_60 G4_60_80 G5_80_100
#> 1     31       0        1        0        0         0
#> 2     79       0        0        0        1         0
#> 3     51       0        0        1        0         0
#> 4     14       1        0        0        0         0
#> 5     67       0        0        0        1         0
#> 6     42       0        0        1        0         0
#> 7     50       0        0        1        0         0
#> 8     43       0        0        1        0         0
#> 9     14       1        0        0        0         0
#> 10    25       0        1        0        0         0
#> 11    90       0        0        0        0         1
#> 12    91       0        0        0        0         1
#> 13    69       0        0        0        1         0
#> 14    91       0        0        0        0         1
#> 15    57       0        0        1        0         0
#> 16    92       0        0        0        0         1
#> 17     9       1        0        0        0         0
#> 18    93       0        0        0        0         1
#> 19    99       0        0        0        0         1
#> 20    72       0        0        0        1         0

^{由reprex package (v2.0.1)于2023年2月27日创建}

- TimTeaFan

1

这也非常有效！我有与@akrun提供的答案相同的问题。如何使前缀为G1表示第1组，G2表示第2组，以此类推。 - Bruh

1

@Bruh：我们可以在.names参数中使用{x_idx}来递增G。 - TimTeaFan

2

在Base R中：

a <- cut(df$score, seq(0,4000, 20))
G <- paste0(as.integer(a), sub("\\((\\d+),(\\d+)\\]", "_\\1_\\2",a))
data.frame(score = df$score, model.matrix(~G+0))

  score G1_0_20 G2_20_40 G3_40_60 G4_60_80 G5_80_100
1     31       0        1        0        0         0
2     79       0        0        0        1         0
3     51       0        0        1        0         0
4     14       1        0        0        0         0
5     67       0        0        0        1         0
6     42       0        0        1        0         0
7     50       0        0        1        0         0
8     43       0        0        1        0         0
9     14       1        0        0        0         0
10    25       0        1        0        0         0
11    90       0        0        0        0         1
12    91       0        0        0        0         1
13    69       0        0        0        1         0
14    91       0        0        0        0         1
15    57       0        0        1        0         0
16    92       0        0        0        0         1
17     9       1        0        0        0         0
18    93       0        0        0        0         1
19    99       0        0        0        0         1
20    72       0        0        0        1         0

- Onyambu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

我们可以使用cut创建分组，然后使用fastDummies中的dummy_cols创建虚拟列。

library(stringr)
library(dplyr)
library(fastDummies)
df %>%
   mutate(grp = cut(score, breaks = c(-Inf, seq(0, 4000, by = 20), Inf)), 
      grp = str_c("G", as.integer(droplevels(grp)), '_', 
      str_replace(grp, '\\((\\d+),(\\d+)\\]', 
     '\\1_\\2'))) %>% 
   dummy_cols("grp", remove_selected_columns = TRUE) %>% 
   rename_with(~ str_remove(.x, 'grp_'), starts_with('grp_'))

-输出

    score G1_0_20 G2_20_40 G3_40_60 G4_60_80 G5_80_100
1     20       1        0        0        0         0
2     99       0        0        0        0         1
3     44       0        0        1        0         0
4     28       0        1        0        0         0
5     63       0        0        0        1         0
6     88       0        0        0        0         1
7     44       0        0        1        0         0
8     59       0        0        1        0         0
9    100       0        0        0        0         1
10    55       0        0        1        0         0
11    37       0        1        0        0         0
12    54       0        0        1        0         0
13     6       1        0        0        0         0
14     7       1        0        0        0         0
15    48       0        0        1        0         0
16    88       0        0        0        0         1
17    97       0        0        0        0         1
18    10       1        0        0        0         0
19    65       0        0        0        1         0
20    18       1        0        0        0         0