如何在R数据框中创建组合变量？

Question

如何在R数据框中创建组合变量？

3

我有一个数据框，其中有多个变量的值为零。我需要构建一个额外的变量，以返回每个观测中非零变量的组合。例如：

df <- data.frame(firm = c("firm1", "firm2", "firm3", "firm4", "firm5"),
                 A = c(0, 0, 0, 1, 2),
                 B = c(0, 1, 0, 42, 0),
                 C = c(1, 1, 0, 0, 0))

现在我想生成新变量：

df$varCombination <- c("C", "B-C", NA, "A-B", "A")

我想到了以下这样的方法，但显然没有成功：

for (i in 1:nrow(df)){
    df$varCombination[i] <- paste(names(df[i,2:ncol(df) & > 0]), collapse = "-")
}

- Antti

3个回答

5

使用apply:

# paste column names
df$varCombination <- 
  apply(df[,2:ncol(df)]>0, 1,
        function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))

# convert blank to NA
df$varCombination[df$varCombination == ""] <- NA

# result
df
#    firm A  B C varCombination
# 1 firm1 0  0 1              C
# 2 firm2 0  1 1            B-C
# 3 firm3 0  0 0           <NA>
# 4 firm4 1 42 0            A-B
# 5 firm5 2  0 0              A

- zx8754

1

你的想法是正确的，但是循环中的逻辑比较不正确。

我尝试保持代码与之前相似，这应该可以工作：

var_names <- names(df)[-1]

df$varCombination <- character(nrow(df))

for (i in 1:nrow(df)){

  non_zero_names <- var_names[df[i, -1] > 0]

  df$varCombination[i] <- paste(non_zero_names, collapse  = '-')

}

> df
   firm A  B C varCombination
1 firm1 0  0 1              C
2 firm2 0  1 1            B-C
3 firm3 0  0 0               
4 firm4 1 42 0            A-B
5 firm5 2  0 0              A

- Mhairi McNeill

谢谢！到目前为止，所有建议的解决方案都非常有效。所以只是在于我的个人喜好，选择你的版本作为最整洁的一个。它没有包含NA替换，但这并不是阻碍。 - Antti

1

@Antti，这不仅仅是品味的问题。在R中，行操作是反直觉的，因为它是一种向量化语言。你选择了迄今为止最慢的解决方案。请看我的回答中的一些基准测试。所以，请在你定义“neatest”时考虑到这一点。 - David Arenburg

@DavidArenburg 我完全同意在R中逐行循环不是一个快速的解决方案。但我认为循环使得代码更加清晰，而且我试图让它保持接近原始代码，这样逻辑对于提问者来说会更容易理解。 - Mhairi McNeill

1

我也使用了循环。只是按列而不是按行。我的循环非常简单易读。因此，使用循环并不是一个争论点。无论如何，我只是想说OP的评论对我来说没有多大意义。虽然我不想说服他接受任何一个答案。选择哪个答案取决于他，我并不在乎我的答案是否被接受。 - David Arenburg

3

@DavidArenburg，我认为你说得对。你所做的基准测试让我信服了。在我的实际应用中，节省的时间将不是微不足道的。我刚开始意识到直观性和效率之间的权衡。因此，我最终会接受你的答案。干杯！ - Antti

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Arenburg · Accepted Answer

使用apply(df, 1, fun)可能很容易解决这个问题，但出于性能考虑，我尝试按列而不是按行解决它（我曾经看到@alexis_laz做过类似的事情，但现在找不到了）。

## Create a logical matrix
tmp <- df[-1] != 0
## or tmp <- sapply(df[-1], `!=`, 0)

## Prealocate result 
res <- rep(NA, nrow(tmp))

## Run per column instead of per row
for(j in colnames(tmp)){
  res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
}

## Remove the pre-allocated `NA` values from non-NA entries
gsub("NA-", "", res, fixed = TRUE)
# [1] "C"   "B-C" NA    "A-B" "A"

一些关于更大数据集的基准测试

set.seed(123)
BigDF <- as.data.frame(matrix(sample(0:1, 1e4, replace = TRUE), ncol = 10))

library(microbenchmark)

MM <- function(df) {
  var_names <- names(df)[-1]
  res <- character(nrow(df))
  for (i in 1:nrow(df)){
    non_zero_names <- var_names[df[i, -1] > 0]
    res[i] <- paste(non_zero_names, collapse  = '-')
  }
  res
}

ZX <- function(df) {
  res <- 
    apply(df[,2:ncol(df)]>0, 1,
          function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))
  res[res == ""] <- NA
  res
}

DA <- function(df) {
  tmp <- df[-1] != 0
  res <- rep(NA, nrow(tmp))

  for(j in colnames(tmp)){
    res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
  }
  gsub("NA-", "", res, fixed = TRUE)
}


microbenchmark(MM(BigDF), ZX(BigDF), DA(BigDF))
# Unit: milliseconds
#      expr       min         lq       mean     median         uq        max neval cld
# MM(BigDF) 239.36704 248.737408 253.159460 252.177439 255.144048 289.340528   100   c
# ZX(BigDF)  35.83482  37.617473  38.295425  38.022897  38.357285  76.619853   100  b 
# DA(BigDF)   1.62682   1.662979   1.734723   1.735296   1.761695   2.725659   100 a