将一个数据框转换为二进制矩阵，其中行名矩阵是原始数据框的单元格信息，列名是数据框的表头。

Question

将一个数据框转换为二进制矩阵，其中行名矩阵是原始数据框的单元格信息，列名是数据框的表头。

3

我有一个数据框，其中包含不同的列，这些列是组，这些列的单元格是属于该列组的物种。我需要将其转换为二进制矩阵，其中列仍然是标题（组），但行将成为物种，如果一个物种最初在该列组中，则为1，否则为0。

# Load the dplyr package
library(dplyr)

# Create a list of vectors with different lengths
list_of_vectors <- list(
  Z1 = c("E","F","G"),
  Z2 = c("A", "B", "C", "D"),
  Z3 = c("H","I","J","K","L")
)

# Find the maximum length
max_length <- max(sapply(list_of_vectors, length))

# Pad the vectors with NA to make them the same length
padded_vectors <- lapply(list_of_vectors, function(x) c(x, rep(NA, max_length - length(x))))

# Create the data frame using dplyr
df <- as.data.frame(bind_cols(padded_vectors))

我想要离开这里：

# data frame
   Z1   Z2    Z3
1   E    A     H
2   F    B     I
3   G    C     J
4   NA   D     K
5   NA   NA    L

转换为这个：

# binary matrix
   Z1   Z2  Z3
E  1    0    0
F  1    0    ...
G  1    0
A  0    1
B  0    1
C  0    1
D  ..   1
H       0    1
I            1
J            ...
K
L

谢谢！

- Pam

1

请注意，data.frame(lapply(list_of_vectors, \length<-`, max(lengths(list_of_vectors))))这段代码中，使用了lengths(list_of_vectors)替代了sapply(..)，同时使用了 `length<-`来将NA` 值附加到特定长度的向量中。 - undefined

2个回答

1

可能你可以像下面这样使用table

> table(stack(df))[na.omit(unlist(df)), ]
      ind
values Z1 Z2 Z3
     E  1  0  0
     F  1  0  0
     G  1  0  0
     A  0  1  0
     B  0  1  0
     C  0  1  0
     D  0  1  0
     H  0  0  1
     I  0  0  1
     J  0  0  1
     K  0  0  1
     L  0  0  1

在表格中，可以直接使用na.omit(unlist(df))作为行名，并重新排序行。

- ThomasIsCoding

谢谢！这个也行。我之前尝试过使用table(stack()...，但没有成功。太棒了！ - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- r2evans · Accepted Answer

out <- +sapply(df, `%in%`, x = sort(unique(na.omit(unlist(df)))))
rownames(out) <- sort(unique(na.omit(unlist(df))))
out
#   Z1 Z2 Z3
# A  0  1  0
# B  0  1  0
# C  0  1  0
# D  0  1  0
# E  1  0  0
# F  1  0  0
# G  1  0  0
# H  0  0  1
# I  0  0  1
# J  0  0  1
# K  0  0  1
# L  0  0  1

或者以一行代码的形式：

with(list(r = sort(unique(na.omit(unlist(df))))), 
     `rownames<-`(+sapply(df, `%in%`, x = r), r))

注意事项：

- 我添加了`na.omit`，因为我不确定你是否想知道哪些地方存在`NA`。如果你认为有用的话，你可以决定是否保留它。 - 我添加了`sort`，因为我认为这样在视觉上更有意义，但这完全是可选的。 - `unique`不是必需的，但如果没有它，会产生重复命名的行。

最后，这是一个“存在”的指标，意味着如果在一列中有重复的字母，我们只看到`1`。

df$Z2[1] <- "B"
with(list(r = sort(unique(na.omit(unlist(df))))), `rownames<-`(+sapply(df, `%in%`, x = r), r))
#   Z1 Z2 Z3
# B  0  1  0
# C  0  1  0
# D  0  1  0
# E  1  0  0
# F  1  0  0
# G  1  0  0
# H  0  0  1
# I  0  0  1
# J  0  0  1
# K  0  0  1
# L  0  0  1

如果你需要它是一个“计数”，那么我们需要

with(list(r = sort(unique(na.omit(unlist(df))))), 
     `rownames<-`(sapply(df, function(col) colSums(outer(col, r, `==`), na.rm = TRUE)), r))
#   Z1 Z2 Z3
# B  0  2  0
# C  0  1  0
# D  0  1  0
# E  1  0  0
# F  1  0  0
# G  1  0  0
# H  0  0  1
# I  0  0  1
# J  0  0  1
# K  0  0  1
# L  0  0  1

数据

df <- structure(list(Z1 = c("E", "F", "G", NA, NA), Z2 = c("A", "B", "C", "D", NA), Z3 = c("H", "I", "J", "K", "L")), row.names = c(NA, -5L), class = "data.frame")