将变量合并成一个列表

Question

将变量合并成一个列表

7

大家好，

我在以下挑战中遇到了困难。我有一个数据集，看起来像这样：

BuyerID    Fruit.1     Fruit.2    Fruit.3    Amount.1    Amount.2    Amount.3
879        Banana      Apple                 4           3
765        Strawberry  Apple      Orange     1           2           4
123        Orange      Banana                1           1           1
 11        Strawberry                        3
773        Kiwi        Banana                1           2

我想要做的是简化数据（如果可能的话），并合并“水果”和“数量”变量。

BuyerID    Fruit                             Amount      Total    Count
879        "Banana" "Apple"                  4  3            7        2
765        "Strawberry" "Apple" "Orange"     1  2  4         7        3
123        "Orange" "Banana"                 1  1  1         3        2
 11        "Strawberry"                      3               3        1
773        "Kiwi" "Banana"                   1  2            3        2

我已尝试使用c()和rbind()，但它们无法产生我想要的结果 - 我也尝试了这里的技巧：data.frame rows to a list，但我不确定这是否是简化我的数据的最佳方法。

这样做的目的是为了让我更容易使用较少的变量来计算某些项目的出现次数，例如（例如60％的买家购买香蕉）。

我希望这是可行的 - 也欢迎任何建议。感谢您。

- jacatra

你可能需要使用 data.table 包：data.frame 只能处理每个单元格中的一个值。 - Señor O

2

这似乎是一个经典的宽到长的“重塑”解决方案的不错选择。@AnandaMahto - 你在哪里？;-) - thelatemail

5

“data.frames”也可以包含列表，这不仅限于“data.table”。当定义时，只需稍微创意一下即可。例如：z <- data.frame(x = 1:5, y = I(lapply(seq_len(5),seq_len)))。 - mnel

还有一个关于列表列的SO参考：https://dev59.com/-2kw5IYBdhLWcg3w8e9J#13115651 - mnel

@SeñorO 我的回答演示了如何使用 data.frame 处理向量，以及为什么这是可能的（请注意，尽管如此，这是一个不好的想法）。 - Tyler Rinker

5个回答

6

这里提供一个基础包的解决方案。它类似于Tyler的解决方案，但只需要单个应用程序。

res <- apply(DT,1,function(x){
  data.frame(Fruit= paste(na.omit(x[2:4]),collapse=' '),
             Amount = paste(na.omit(x[5:7]),collapse =','),
             Total = sum(as.numeric(na.omit(x[5:7]))),
             Count = length(na.omit(x[2:4])))
})
do.call(rbind,res)
                    Fruit  Amount Total Count
1            Banana Apple    4, 3     7     2
2 Strawberry Apple Orange 1, 2, 4     7     3
3           Orange Banana 1, 1, 1     3     2
4              Strawberry       3     3     1
5             Kiwi Banana    1, 2     3     2

我会使用grep更改索引号，类似于以下内容。

 Fruit  = gregexpr('Fruit[.][0-9]', colnames(dat)) > 0  
 Amount = gregexpr('Amount[.][0-9]', colnames(dat)) > 0 

 x[2:4] replace by x[which(Fruit)]....

编辑添加一些基准测试。

library(microbenchmark)
library(data.table)
microbenchmark(ag(),mn(), am(), tr())
Unit: milliseconds
  expr       min        lq    median        uq       max
1 ag() 11.584522 12.268140 12.671484 13.317934 109.13419
2 am()  9.776206 10.515576 10.798504 11.437938 137.44867
3 mn()  6.470190  6.805646  6.974797  7.290722  48.68571
4 tr()  1.759771  1.929870  2.026960  2.142066   7.06032

对于小型的数据框，Tyler Rinker 是赢家！！我解释一下可能原因：

数据表格方案存在使用 reshape 的问题，通常 data.table 用于处理大数据更加快速。
Ag 研究方案较慢，因为需要对每一行进行子集操作，而不像 Tyler 方案是在应用之前进行子集操作。
am方案因为使用了 reshape 和 merge 操作，所以较慢。

- agstudy

没错，reshape是一个非data.table函数，所以mn()不是一个纯粹的data.table解决方案。 - Matt Dowle

@agstudy 别忘了你还选择了paste，它比unlist慢。 - Tyler Rinker

我认为你无法通过这种技术将向量放入单元格中。你现在拥有的是单元格中的字符串，这相当容易做到，但你失去了轻松操作向量作为数字向量的能力；例如，你不能像以前那样使用sapply(Amount, max)，因为你现在拥有的是字符向量。 - Tyler Rinker

@TylerRinker 是的，我理解你的意思。在这里使用paste是一种解决方法。甚至OP也没有明确说明他想要一个原子向量还是单个字符串。 - agstudy

5

这真的不是一个好主意，但在基础的data.frame中实现了。它能够工作是因为data.frame实际上是一个等长向量的列表。你可以强制data.frame将向量存储在单元格中，但需要一些技巧。我建议使用其他格式，包括Marius的建议或列表。

DT <- data.frame(
  BuyerID = c(879,765,123,11,773), 
  Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
  Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
  Fruit.3 = c( NA, 'Orange',NA,NA,NA),
  Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
  stringsAsFactors = FALSE)

DT2 <- DT[, 1, drop=FALSE]
DT2$Fruit <- apply(DT[, 2:4], 1, function(x) unlist(na.omit(x)))
DT2$Amount <- apply(DT[, 5:7], 1, function(x) unlist(na.omit(x)))
DT2$Total <- sapply(DT2$Amount, sum)
DT2$Count <- sapply(DT2$Fruit, length)

产出：

> DT2
  BuyerID                     Fruit  Amount Total Count
1     879             Banana, Apple    4, 3     7     2
2     765 Strawberry, Apple, Orange 1, 2, 4     7     3
3     123            Orange, Banana 1, 1, 1     3     2
4      11                Strawberry       3     3     1
5     773              Kiwi, Banana    1, 2     3     2

- Tyler Rinker

可能可以，但强制执行需要谨慎。不确定。 - Tyler Rinker

我认为list（1:3，1:3，1:2）是长度为3的向量，因此没问题。 - mnel

length(list(1:3,1:3,1:2)) 的结果是 3，因此有关等长向量的观点是无效的。我同意列表列可能很难处理，所以也许我并没有真正的观点，除了成为一个学究！ - mnel

@TylerRinker +1 因为我的解决方案相对于你的来说有点臃肿 :) 我猜在 apply 中按行进行子集划分速度太慢了。 - agstudy

4

除了已有的优秀答案，这里提供另一种（仅使用基础R语言）：

with(DT, {
  # Convert to long format
  DTlong <- reshape(DT, direction = "long", 
                    idvar = "BuyerID", varying = 2:ncol(DT))
  # aggregate your fruit columns 
  # You need the `do.call(data.frame, ...)` to convert
  #   the resulting matrix-as-a-column into separate columns
  Agg1 <- do.call(data.frame, 
                  aggregate(Fruit ~ BuyerID, DTlong,
                            function(x) c(Fruit = paste0(x, collapse = " "),
                                          Count = length(x))))
  # aggregate the amount columns
  Agg2 <- aggregate(Amount ~ BuyerID, DTlong, sum)
  # merge the results
  merge(Agg1, Agg2)
})
#   BuyerID             Fruit.Fruit Fruit.Count Amount
# 1      11              Strawberry           1      3
# 2     123           Orange Banana           2      3
# 3     765 Strawberry Apple Orange           3      7
# 4     773             Kiwi Banana           2      3
# 5     879            Banana Apple           2      7

基本概念是:

使用reshape将数据转换成长格式 (我认为这就可以停止了)
使用两个不同的aggregate命令，一个用于聚合果实列，另一个用于聚合数量列。 aggregate的公式方法会处理删除NA，但你可以使用na.action参数指定所需的行为。
使用merge将两者组合在一起。

请注意保留HTML标记。

- A5C1D2H2I1M1N2O1R2T1

0

当问题被提出时，这个功能还不存在，但是tidyr很适合这个问题。

重复使用@mnel的答案中的数据，

library(tidyr)
separator <- ' '
DT %>%
  unite(Fruit, grep("Fruit", names(.)), sep = separator) %>%
  unite(Amount, grep("Amount", names(.)), sep = separator)

#   BuyerID                   Fruit  Amount Total Count
# 1     879         Banana Apple NA  4 3 NA     7     2
# 2     765 Strawberry Apple Orange   1 2 4     7     3
# 3     123        Orange Banana NA   1 1 1     3     2
# 4      11        Strawberry NA NA 3 NA NA     3     1
# 5     773          Kiwi Banana NA  1 2 NA     3     2

- jaimedash

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mnel · Accepted Answer

试图复制您的数据，并使用data.table

DT  <- data.frame(
  BuyerID = c(879,765,123,11,773), 
  Fruit.1 = c('Banana','Strawberry','Orange','Strawberry','Kiwi'),
  Fruit.2 = c('Apple','Apple','Banana',NA,'Banana'),
  Fruit.3 = c( NA, 'Orange',NA,NA,NA),
  Amount.1 = c(4,1,1,3,1), Amount.2 = c(3,2,1,NA,2), Amount.3 = c(NA,4,1,NA,NA),
  Total = c(7,7,3,3,3), 
  Count = c(2,3,2,1,2), 
  stringsAsFactors = FALSE)

# reshaping to long form and data.table

library(data.table)
DTlong <- data.table(reshape(DT, varying = list(Fruit = 2:4, Amount = 5:7), 
  direction = 'long'))

# create lists (without NA values)
# also adding count and total columns 
# by using <- to save Fruit and Amount for later use

DTlist <- DTlong[, list(Fruit <- list(as.vector(na.omit(Fruit.1))), 
                        Amount <- list(as.vector(na.omit(Amount.1))), 
                        Count  = length(unlist(Fruit)),
                        Total = sum(unlist(Amount))), 
                 by = BuyerID]

  BuyerID                      V1    V2 Count Total
1:     879            Banana,Apple   4,3     2     7
2:     765 Strawberry,Apple,Orange 1,2,4     3     7
3:     123           Orange,Banana 1,1,1     2     3
4:      11              Strawberry     3     1     3
5:     773             Kiwi,Banana   1,2     2     3

@RicardoSaporta编辑：

如果您愿意的话，可以跳过重新塑形步骤，使用list(list(c(....)))。这可能会节省相当多的执行时间（缺点是它会添加NA而不是空格）。但是，正如@Marius所指出的那样，上面的DTlong可能更容易使用。

DT <- data.table(DT)
DT[,   Fruit := list(list(c(  Fruit.1,   Fruit.2,   Fruit.3))), by=BuyerID]
DT[, Ammount := list(list(c(Amount.1, Amount.2, Amount.3))), by=BuyerID]

# Or as a single line
DT[,   list(  Fruit = list(c( Fruit.1,  Fruit.2,  Fruit.3)), 
            Ammount = list(c(Amount.1, Amount.2, Amount.3)), 
            Total, Count),  # other columns used
            by = BuyerID]