Julia：从分组数据框中随机抽取N组

Question

Julia：从分组数据框中随机抽取N组

3

我有一个时间序列记录的数据框，其中包含商品销售信息，我要用它绘制图表，但是数据很多，我想要随机抽取 N 条记录。

以下是三个商品的简化示例数据，我想要随机抽取其中两个：

df = DataFrame(time = [0, 1, 0, 1, 0, 1]
    , amt = [19.00, 11.00, 35.50, 32.50, 5.99, 5.99]
    , item = ["B001", "B001", "B020", "B020", "BX00", "BX00"])

6×3 DataFrame
│ Row │ time  │ amt     │ item   │
│     │ Int64 │ Float64 │ String │
├─────┼───────┼─────────┼────────┤
│ 1   │ 0     │ 19.0    │ B001   │
│ 2   │ 1     │ 11.0    │ B001   │
│ 3   │ 0     │ 35.5    │ B020   │
│ 4   │ 1     │ 32.5    │ B020   │
│ 5   │ 0     │ 5.99    │ BX00   │
│ 6   │ 1     │ 5.99    │ BX00   │

我在研究后找到了一个解决方案，但似乎这并不是表达这个问题的简单方式。

# this attaches a random number to each group, sorts it, and then ranks each group:

using StatsBase

@pipe df |> groupby(_, :item) |>
     combine(_, :time, :amt, :item, :item => (x -> rand()) => :rando) |>
     sort(_, :rando) |>
     transform(_, :rando => denserank => :rnk_rnd)

6×5 DataFrame
│ Row │ item   │ time  │ amt     │ rando    │ rnk_rnd │
│     │ String │ Int64 │ Float64 │ Float64  │ Int64   │
├─────┼────────┼───────┼─────────┼──────────┼─────────┤
│ 1   │ B001   │ 0     │ 19.0    │ 0.449577 │ 1       │
│ 2   │ B001   │ 1     │ 11.0    │ 0.449577 │ 1       │
│ 3   │ BX00   │ 0     │ 5.99    │ 0.482569 │ 2       │
│ 4   │ BX00   │ 1     │ 5.99    │ 0.482569 │ 2       │
│ 5   │ B020   │ 0     │ 35.5    │ 0.612401 │ 3       │
│ 6   │ B020   │ 1     │ 32.5    │ 0.612401 │ 3       │


# I only need the original columns, and I'll filter for the first N=2 items from the re-constituted dataframe

@pipe ans |> filter(:rnk_rnd => <=(2), _)  |>
     select(_, :item, :time, :amt)

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 0     │ 5.99    │
│ 2   │ BX00   │ 1     │ 5.99    │
│ 3   │ B001   │ 0     │ 19.0    │
│ 4   │ B001   │ 1     │ 11.0    │

# this is exactly what I'm looking for

有没有其他更紧凑的方法从分组的数据框中随机抽样？

- Merlin

2个回答

2

我从DataFrames.jl的一个问题中学到了一个更紧凑的表达式。

@pipe df |> 
    groupby(_, :item) |>
    _[shuffle(1:end)] |>
    combine(_[1:2], :)

这将导致我选择的相同随机分组，以数据框形式返回：

4×3 DataFrame
│ Row │ item   │ time  │ amt     │
│     │ String │ Int64 │ Float64 │
├─────┼────────┼───────┼─────────┤
│ 1   │ BX00   │ 0     │ 5.99    │
│ 2   │ BX00   │ 1     │ 5.99    │
│ 3   │ B020   │ 0     │ 35.5    │
│ 4   │ B020   │ 1     │ 32.5    │

我认为最终会有一个针对分组数据框的shuffle函数，如果我们都支持这个问题！

- Merlin

首先需要合并 https://github.com/JuliaLang/Statistics.jl/pull/52。 - Bogumił Kamiński

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bogumił Kamiński · Accepted Answer

另一种选择是使用 StatsBase.jl 中的 sample:

@pipe df |>
      groupby(_, :item) |>
      _[sample(1:length(_), 2, replace=false)] |>
      DataFrame

如果你从你的DataFrame中接受一个随机分数q（不是一个固定的数字），那么这将变得更加容易：

@pipe df |>
      groupby(_, :item) |>
      combine(sdf -> rand() < q ? sdf : DataFrame(), _)