如何拆分数据框 -> 对子集应用合并 -> 合并为数据框。

Question

如何拆分数据框 -> 对子集应用合并 -> 合并为数据框。

3

我不知道如何在不使用for循环的情况下实现这个:

x <- c('a', 'b', 'c', 'd')

> x
[1] "a" "b" "c" "d"

data <- data.frame(
   x=c('a', 'b', 'a', 'b', 'c', 'a', 'a', 'b', 'c', 'd'),
   name=c('one','one', 'two','two','two', 'three', 'four','four','four','four'),
   other=c(1, 4, 5, 3, 2, 4, 5, 6, 3, 2)
)

> data
   x  name other
1  a   one     1
2  b   one     4
3  a   two     5
4  b   two     3
5  c   two     2
6  a three     4
7  a  four     5
8  b  four     6
9  c  four     3
10 d  four     2

我希望能按照name的值将data拆分成子组，并将每个子组中的x进行merge，以填充“缺失行”，得到如下结果：

> data
   x  name other
1  a   one     1
2  b   one     4
   c   one     0 <- missing row added
   d   one     0 <- missing row added
3  a   two     5
4  b   two     3
5  c   two     2
   d   two     0 <- missing row added
6  a three     4
   b three     0 <- missing row added
   c three     0 <- missing row added
   d three     0 <- missing row added
7  a  four     5
8  b  four     6
9  c  four     3
10 d  four     2

最后，将 data.frame 格式重构为如下形式：

> data
   x  one  two  three  four
1  a    1    5      4     5
2  b    4    3      0     6
3  c    0    2      0     3
4  d    0    0      0     2

我可以使用for循环实现它，但我相信一定有更好的方法，比如使用*apply、by、split或类似的东西。有什么建议吗？

**更新**

最终我使用了被接受答案的一点修改（再次感谢，小伙子！），因为我不喜欢使用levels，也不在乎列的顺序。

grid <- expand.grid(x, unique(data$name))
colnames(grid) <- c("x", "name")
data <- merge(grid, data, all.x = TRUE)
data[is.na(data)] <- 0
dcast(data, x ~ name, value.var = 'other')

- thelawnmowerman

1

关于你的“finally”，还可以参考xtabs函数；第一部分可以使用xtabs(other ~ x + name, data)，然后再使用as.data.frame(xtabs(other ~ x + name, data))。 - alexis_laz

太棒了！我没想到还有像你这样的解决方案！现在我得理解一下xtabs是怎么工作的。非常感谢你，伙计！ - thelawnmowerman

3个回答

1

更加直接：

你所需要的只是reshape2::dcast：

# clean up factor levels for prettier results
data$name <- factor(data$name, levels = c('one', 'two', 'three', 'four'))

library(reshape2)
dcast(data, x ~ name, value.var = 'other', fill = 0)

#   x one two three four
# 1 a   1   5     4    5
# 2 b   4   3     0    6
# 3 c   0   2     0    3
# 4 d   0   0     0    2

按要求：

按照您所列出的步骤，首先使用expand.grid获取组合，然后使用merge和all = TRUE合并，最后使用reshape2::dcast重新排列：

df <- merge(data, expand.grid(x, levels(data$name)), 
            by.x = c('x', 'name'), by.y = c('Var1', 'Var2'), all = TRUE)

df[is.na(df)] <- 0         # replace `NA`s with 0
df$name <- factor(df$name, levels = c('one', 'two', 'three', 'four')) # fix order of levels

library(reshape2)
dcast(df, x ~ name, value.var = 'other')

#    x one two three four
# 1 a   1   5     4    5
# 2 b   4   3     0    6
# 3 c   0   2     0    3
# 4 d   0   0     0    2

- alistaire

这几乎是解决方案！但输出的数据框已经交换了列，你能看出来吗？ - thelawnmowerman

1

哦，老兄，你救了我的命！！！非常感谢你的时间。我喜欢StackOverflow和它所有的社区；-) - thelawnmowerman

0

回答你的第一个问题，你可以使用expand.grid。这里应用的逻辑是：
你的数据：

x=c('a', 'b', 'a', 'b', 'c', 'a', 'a', 'b', 'c', 'd')
name=c('one','one', 'two','two','two', 'three', 'four','four','four','four')
other=c(1, 4, 5, 3, 2, 4, 5, 6, 3, 2)

将此转换为数据框：

ee<-data.frame(x,name,other)

现在使用expand.grid来扩展并将所有组合应用于x和name：

dd<-expand.grid(unique(x), unique(name))

这看起来像：

    Var1  Var2
1     a   one
2     b   one
3     c   one
4     d   one
5     a   two
6     b   two
7     c   two
8     d   two
9     a three
10    b three
11    c three
12    d three
13    a  four
14    b  four
15    c  four
16    d  four

所有组合已经创建完成：现在可以使用SQLDF或任何合并包：

ff<-sqldf("select Var1, Var2, ifnull(c.other,0) from dd left join ee c on x=Var1 and name=Var2")

因此，您的输出为：

    Var1  Var2 other
1     a   one     1
2     b   one     4
3     c   one     0
4     d   one     0
5     a   two     5
6     b   two     3
7     c   two     2
8     d   two     0
9     a three     4
10    b three     0
11    c three     0
12    d three     0
13    a  four     5
14    b  four     6
15    c  four     3
16    d  four     2
>

- CuriousBeing

谢谢！这是一个非常好的解决方案，我更喜欢使用merge而不是sqldf，但这正是我的想法！我会在我的原始问题中更新您的评论 :-) - thelawnmowerman

我非常感谢你的帮助！但是 @alistaire 给出了完整的解决方案，抱歉 :-( - thelawnmowerman

没问题。祝你有美好的一天。 :) - CuriousBeing

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- G. Grothendieck · Accepted Answer

尝试使用 xtabs。不需要安装任何包。

首先，将 name 的级别按顺序排列，以便使列正常排序：

data$name <- factor(data$name, levels = c("one", "two", "three", "four"))
tab <- xtabs(other ~., data)

给出以下 c("xtabs", "table") 类的输出：

> tab
   name
x   one two three four
  a   1   5     4    5
  b   4   3     0    6
  c   0   2     0    3
  d   0   0     0    2

或者使用as.data.frame.matrix(tab)，如果需要输出具有"data.frame"类的结果。