如何在R中基于两列数据创建虚拟变量

5
假设我有一个数据框: 性别可以是 F 表示女性,M 表示男性 种族可以是 A 表示亚洲人,W 表示白人,B 表示黑人和 H 表示西班牙裔。
| id | Gender | Race |
| --- | ----- | ---- |
| 1   | F    | W |
| 2   | F    | B |
| 3   | M    | A |
| 4   | F    | B |
| 5   | M    | W |
| 6   | M    | B |
| 7   | F    | H |

我希望能够根据性别和种族创建一组虚拟列,数据框应该类似于:

| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1   | F    | W   |  1  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
| 2   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 3   | M    | A   |  0  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |
| 4   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 5   | M    | W   |  0  |  0  |  0  |  0  |  1  |  0  |  0  |  0  |
| 6   | M    | B   |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  0  |
| 7   | F    | H   |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0  |

我的实际数据包含的类别比这个例子要多得多,因此我非常感谢您能以更整洁的方式呈现它。 语言是R。 感谢您的帮助。

5个回答

4
除了列名之外,您可以使用model.matrix函数和仅表达交互项的公式,并减去截距来获得此内容:
> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
   id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1   1      F    H             0             0             0             0
2   2      M    H             0             0             0             0
3   3      M    W             0             0             0             0
4   4      F    H             0             0             0             0
5   5      M    H             0             0             0             0
[etc]

如果你关心确切的名称,通过一些字符串处理很容易解决。

> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
   id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1   1      F    H   0   0   0   0   1   0   0   0
2   2      M    H   0   0   0   0   0   1   0   0
3   3      M    W   0   0   0   0   0   0   0   1
4   4      F    H   0   0   0   0   1   0   0   0
5   5      M    H   0   0   0   0   0   1   0   0
6   6      F    H   0   0   0   0   1   0   0   0
7   7      F    H   0   0   0   0   1   0   0   0
8   8      M    A   0   1   0   0   0   0   0   0
9   9      M    W   0   0   0   0   0   0   0   1
10 10      F    B   0   0   1   0   0   0   0   0

如果您关心列顺序...


3

我想你可以使用以下解决方案。它只比你要求的输出少2个变量,但无论如何输出都将为零。由于 pivot_wider 将扩展数据集中可以找到的所有组合。

library(dplyr)
library(tidyr)

df %>%
  mutate(grp = 1) %>%
  pivot_wider(names_from = c(Gender, Race), values_from = grp, 
              values_fill = 0, names_glue = "{Gender}_{Race}") %>%
  right_join(df, by = "id") %>%
  relocate(id, Gender, Race)

# A tibble: 7 x 9
     id Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 F      W         1     0     0     0     0     0
2     2 F      B         0     1     0     0     0     0
3     3 M      A         0     0     1     0     0     0
4     4 F      B         0     1     0     0     0     0
5     5 M      W         0     0     0     1     0     0
6     6 M      B         0     0     0     0     1     0
7     7 F      H         0     0     0     0     0     1

1
很好,你恢复了你的解决方案! - TarJae
1
是的,我对此有不好的预感。这并不是错误的答案。 - Anoushiravan R
1
谢谢同样的支持! - xxx

3
除了 Anoushiravan R. 提供的 tidyverse 解决方案外,这里还有另一种选项,使用 unitepivot_wideracrosscase_when
library(tidyverse)
  df %>% 
    unite(comb, Gender:Race, remove = FALSE) %>% 
    pivot_wider(
      names_from = comb,
      values_from = comb
    ) %>% 
    mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H), 
                  ~ case_when(is.na(.) ~ 0, 
                              TRUE ~ 1)))

输出:

  id    Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <chr> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1     F      W         1     0     0     0     0     0
2 2     F      B         0     1     0     0     0     0
3 3     M      A         0     0     1     0     0     0
4 4     F      B         0     1     0     0     0     0
5 5     M      W         0     0     0     1     0     0
6 6     M      B         0     0     0     0     1     0
7 7     F      H         0     0     0     0     0     1

3

使用tablebase R选项

 cbind(df1, as.data.frame.matrix(table(transform(df1, 
    GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

数据

df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M", 
"M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")), 
class = "data.frame", row.names = c(NA, 
-7L))

3

另一个使用基本R选项的方法是xtabs

cbind(
    df,
    as.data.frame.matrix(
        xtabs(
            ~ id + q,
            transform(
                df,
                q = paste0(Gender, "_", Race)
            )
        )
    )
)

提供

  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接