数据表中多列的长格式转宽格式

Question

数据表中多列的长格式转宽格式

3

我有一个关于 data.table 中的 melt 和 dcast 用于多列的问题。我在 StackOverFlow 上浏览了一些帖子，但很多类似的帖子并不是我想要的。我将在下面进行解释。

首先，data 是关于问题原因和价值金额的。这是我的部分 data:

ID   Type    Problem1    Value1     Problem2    Value2    Problem3    Value3
1    A       X           500        Y           1000      Z           400
2    A       X           600        Z           700       
3    B       Y           700        Z           100
4    B       W           200        V           200
5    C       Z           500        V           500       
6    C       X           1000       W           100       V           900

其次，ID 是唯一的。 Type 包含三个（A、B 和 C）。有5个问题。

以ID == 1为例。它是Type A，包含3个问题（X、Y和Z）。它的Problem X 值为500，Problem Y值为1000，Problem Z值为400。以ID == 5为例。它是Type C，包含2个问题（Z 和 V）。它的Problem Z值为500，Problem V的值为500。

第三，列 ID，Type，Problem1，Problem2 和 Problem3 是 character。 Value1，Value2 和 Value3 是 numeric。

我想要的结果是：

Type    X     Y     Z     W     V
A       1100  1000  1100  0     0   
B       0     700   100   200   200
C       1000  0     500   100   1400

我不知道如何恰当地在这里解释。我想将Type分组，然后对每个问题的值进行求和。我认为这是关于长宽转换的内容。我在这里和这里找到了参考资料。第二个可能有用。但是，我不知道从哪里开始。有什么建议吗？

# data
dt <- fread("
ID   Type    Problem1    Value1     Problem2    Value2    Problem3    Value3
1    A       X           500        Y           1000      Z           400
2    A       X           600        Z           700       
3    B       Y           700        Z           100
4    B       W           200        V           200
5    C       Z           500        V           500       
6    C       X           1000       W           100       V           900", fill = T)

- Peter Chen

4个回答

3

这是一种简单而直接的方法，但仍然有效（希望有人能够帮助改进我的解决方案）。

library(magrittr)
rbind(
    dt[, .(Type, P = Problem1, V = Value1)],
    dt[, .(Type, P = Problem2, V = Value2)],
    dt[, .(Type, P = Problem3, V = Value3)]) %>%
    .[P != ""] %>%
    dcast(Type ~ P, value.var = "V", sum)

编辑使用akrun的代码进行了改进（将函数传递给dcast）。

- pogibas

2

如果你使用的是dplyr/tidyr/pipe工具，我会遵循相同的逻辑来处理data.table答案，并使用gather和spread函数。 - thelatemail

dplyr/tidyr/pipe 很好用。但是，data.table 在大数据上更快。 - Peter Chen

3

这可以很容易地通过使用dplyr / tidyr完成：

library("dplyr")
library("tidyr")

# assume x is your dataframe
bind_rows(
  select(x, ID, Type, Problem = Problem1, Value = Value1),
  select(x, ID, Type, Problem = Problem2, Value = Value2),
  select(x, ID, Type, Problem = Problem3, Value = Value3)
  ) %>%
filter(!(is.na(Problem))) %>%
group_by(Type, Problem) %>%
summarise(Value = sum(Value)) %>%
spread(Problem, Value, fill = 0)

输出

# A tibble: 3 x 6
# Groups:   Type [3]
   Type     V     W     X     Y     Z
* <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1     A     0     0  1100  1000  1100
2     B   200   200     0   700   100
3     C  1400   100  1000     0   500

如果列 V - Z 的顺序很重要，可以通过添加最后的 select 语句来轻松解决这个问题。

- Josh Gilfillan

在dpylr/tidyr中也有进行多重gather的方法 - https://dev59.com/r18e5IYBdhLWcg3wRIlQ - thelatemail

3

这里有一种方法，它使用了akrun执行的melt函数，然后使用矩阵子集返回所需的结果。

# melt and aggregate the data
temp <- melt(dt, measure = patterns("^Value", "^Problem"),
             value.name = c("Value", "Problem"))[
        !is.na(Value), .(Value=sum(Value)), by=.(Type, Problem)]

# set up the storage matrix
dimNames <- list(sort(unique(temp$Type)), unique(temp$Problem))
myMat <- matrix(0, length(dimNames[[1]]), length(dimNames[[2]]), dimnames=dimNames)

# fill in the matrix with the desired values
myMat[cbind(temp$Type, temp$Problem)] <- temp$Value

这将返回矩阵。

myMat
     X    Y   W    Z    V
A 1100 1000   0 1100    0
B    0  700 200  100  200
C 1000    0 100  500 1400

为了返回一个data.table，你可以这样做：

data.table(myMat, keep.rownames=TRUE)
   rn    X    Y   W    Z    V
1:  A 1100 1000   0 1100    0
2:  B    0  700 200  100  200
3:  C 1000    0 100  500 1400

- lmo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- akrun · Accepted Answer

我们可以通过在“measure”中指定“patterns”来首先将数据融合为“long”格式，然后使用“sum”作为“fun.aggregate”进行“dcast”。

dcast(melt(dt, measure = patterns("^Value", "^Problem"), 
    value.name = c("Value", "Problem"))[Problem != ""
     ][, Problem := factor(Problem, levels = c("X", "Y", "Z", "W", "V"))], 
     Type ~Problem, value.var = "Value", sum, na.rm = TRUE)
#   Type    X    Y    Z   W    V
#1:    A 1100 1000 1100   0    0
#2:    B    0  700  100 200  200
#3:    C 1000    0  500 100 1400

data.table中的melt函数可以在measure参数中接受多个patterns模式。因此，当我们使用"^Value"时，它将匹配所有列名以"Value"开头（^）的列，类似地，对于"Problem"也是如此，并创建两个'value'列。在上面的代码中，我们将这些列命名为'Value'和'Problem'，并使用value.name参数。由于数据集中有一些空白，所以长格式还包含空白元素，我们使用Problem != ""删除这些元素。接下来的步骤只有在需要按特定顺序排列列时才很重要。因此，我们将'Problem'更改为factor类，并指定了该顺序的levels。现在，melt部分已经完成。使用dcast将长格式转换为'wide'，通过指定公式、value.var列和fun.aggregate（这里是sum）实现转换。