Julia中对分组列进行多个汇总统计

Question

Julia中对分组列进行多个汇总统计

4

我正在尝试使用Julia（1.5.3）运行以下代码，这只是我尝试做的事情的表示。

using DataFrames
using DataFramesMeta
using RDatasets

## setup
iris = dataset("datasets", "iris")
gdf = groupby(iris, :Species)

## Applying the split combine
## This code works fine
combine(gdf, nrow, (valuecols(gdf) .=> mean))

但是，当我尝试对多个摘要进行操作时，它会失败。

 combine(gdf, nrow, (valuecols(gdf) .=> [mean, sum]))

错误:

错误：尺寸不匹配（"arrays could not be broadcast to a common size; got a dimension with lengths 4 and 2"）

对错误进行简单调试后，建议将代码更改为以下内容：

combine(gdf, nrow, ([:SepalLength, :PetalLength] .=> [mean,sum]))
## This code works but its still not correct as it doesn't tell me the mean and sum of both the columns , rather mean for SepalLength and sum for PetalLength, which was expected as per previous error

进一步研究后我意识到，我们可以像这样做，这个结果是正确的，但输出格式是长表格形式而不是宽表格形式。我原本期望这会给我答案，但看起来并没有按照预期工作。

 combine(gdf, ([:SepalWidth, :PetalWidth] .=>  x -> ([sum(x), mean(x)])))

 ## The code above works but output is 6x3 DataFrame, I was expecting 3x6 DataFrame

我的问题是:

有没有一种方法可以使用split combine，以便我获得如下的宽表(我已经使用“do end”和“combine”来生成它)。我可以接受这个解决方案，但我需要在这里输入所有列，有办法可以获得所有摘要统计信息（总和、中位数、平均值等）作为combine提供的所有列的列吗？我希望我的问题很清楚，请指出它是否是重复的或者是否沟通不畅。谢谢

combine(gdf) do x
    return(sw_sum = sum(x.SepalWidth), 
           sw_mean = mean(x.SepalWidth), 
           sp_mean = mean(x.PetalWidth), 
           sp_sum = sum(x.PetalWidth)
          )
end



## My expected answer should be similar to this
#3×5 DataFrame
# Row │ Species     sw_sum   sw_mean  sp_mean  sp_sum
#     │ Cat…        Float64  Float64  Float64  Float64
#─────┼────────────────────────────────────────────────
#   1 │ setosa        171.4    3.428    0.246     12.3
#   2 │ versicolor    138.5    2.77     1.326     66.3
#   3 │ virginica     148.7    2.974    2.026    101.3

此外，这个也可以正常工作：

 combine(gdf, [:1] .=> [mean, sum, minimum, maximum,median])

但是这样做仍然会出现维度错误，就像上面一样，我还在仔细思考这个问题：

combine(gdf, [:1, :2] .=> [mean, sum, minimum, maximum,median])

- PKumar

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bogumił Kamiński · Accepted Answer

Do:

 combine(gdf, nrow, vec(valuecols(gdf) .=> [mean sum]))

或者

 combine(gdf, nrow, (valuecols(gdf) .=> [mean sum])...)

或者

 combine(gdf, nrow, [n => f for n in valuecols(gdf) for f in [mean sum]])

（注意，在 mean 和 sum 之间没有逗号。）

原因是需要添加一个额外的维度来广播 .=>，以获取所有输入组合。

编辑： ... 只是迭代一个集合并将其元素作为连续的位置参数传递给函数，例如：

julia> f(x...) = x
f (generic function with 1 method)

julia> f(1, [2,3,4]...)
(1, 2, 3, 4)