在具有不同行数的组内进行n次随机抽取样本

Question

在具有不同行数的组内进行n次随机抽取样本

3

如何从每个组中绘制 n 行，而每个组的行数都不同？

df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

我已经尝试过，

library(dplyr)
outdat <- df %>% 
  group_by(color) %>% 
  sample_n(nrow(.), replace = TRUE)
outdat

但是这将返回一个数据框，其中nrow(.)是df的行数，而不是子集的行数。这篇SO文章很接近，但它定义了特定数量的行绘制。我需要它针对dplyr中的特定组。

- Vedda

我不清楚你想从每个组中抽取多少行样本，以及这与每个组的原始行数有什么关系。 - Stuart Allen

你想从你的df中采样多少行？如果你想要10行，你可以使用sample_n(df, 10)。 - myincas

@Snubian 我想从分组数据中抽样行数。 - Vedda

@mt1022 我尝试了 n()，但它不能直接使用。错误信息：This function should not be called directly。 - Vedda

1

@RonakShah 输出数据应具有与原始数据相同的维度，但由于使用替换抽样，观测值可能会有所不同。 - Vedda

显示剩余2条评论

3个回答

3

另一个解决方案使用 slice 和 sample.int。从 www 重复使用数据：

outdat <- df %>% 
group_by(color) %>% 
slice(sample.int(n(),replace=T))
outdat

            X1          X2  color
1   1.71506499 -1.12310858   blue
2   0.07050839  2.16895597   blue
3   0.46091621 -0.40288484   blue
4   0.07050839  2.16895597   blue
5   0.07050839  2.16895597   blue
6   1.71506499 -1.12310858   blue
7  -1.26506123 -0.46665535   blue
8   1.55870831 -1.26539635   blue
9   0.12928774  1.20796200   blue
10  1.55870831 -1.26539635   blue
11  0.55391765 -0.28477301   pink
12 -0.29507148 -2.30916888   pink
13 -0.30596266  0.18130348   pink
14 -0.06191171 -1.22071771   pink
15  0.55391765 -0.28477301   pink
16  0.55391765 -0.28477301   pink
17  0.87813349 -0.70920076   pink
18  0.68864025  1.02557137   pink
19 -0.30596266  0.18130348   pink
20  0.68864025  1.02557137   pink
21  0.70135590  0.12385424    red
22  0.11068272  1.36860228    red
23 -1.96661716  0.58461375    red
24  0.40077145 -0.04287046    red
25  1.78691314  1.51647060    red
26 -0.55584113 -0.22577099    red
27  0.40077145 -0.04287046    red
28  1.78691314  1.51647060    red
29 -0.47279141  0.21594157    red
30 -0.47279141  0.21594157    red
31 -1.02600445 -0.33320738 yellow
32 -0.72889123 -1.01857538 yellow
33  1.25381492  2.05008469 yellow
34  0.83778704  0.44820978 yellow
35  1.25381492  2.05008469 yellow
36 -0.62503927 -1.07179123 yellow
37 -0.62503927 -1.07179123 yellow
38  0.83778704  0.44820978 yellow
39 -0.21797491 -0.50232345 yellow
40 -1.68669331  0.30352864 yellow

- Lamia

2

使用purrr包的解决方法。似乎sample_n函数不能将n()作为大小参数，可能是因为该参数不接受矢量化输入。然而，如果我们按组将数据框拆分为color，则可以对每个组应用sample_n和nrow()。

# Set seed for reproducibility
set.seed(123)

# Create example data frame
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <-  rep(c("blue", "red", "yellow", "pink"), each=10)

# Load packages
library(dplyr)
library(purrr)

outdat <- df %>%
  # Split the data frame by color
  split(.$color) %>%
  # Apply the sample_n function to all data frames
  map_dfr(~sample_n(., size = nrow(.), replace = TRUE))

outdat
#             X1          X2  color
# 1   1.71506499 -1.12310858   blue
# 2   0.07050839  2.16895597   blue
# 3   0.46091621 -0.40288484   blue
# 4   0.07050839  2.16895597   blue
# 5   0.07050839  2.16895597   blue
# 6   1.71506499 -1.12310858   blue
# 7  -1.26506123 -0.46665535   blue
# 8   1.55870831 -1.26539635   blue
# 9   0.12928774  1.20796200   blue
# 10  1.55870831 -1.26539635   blue
# 11  0.55391765 -0.28477301   pink
# 12 -0.29507148 -2.30916888   pink
# 13 -0.30596266  0.18130348   pink
# 14 -0.06191171 -1.22071771   pink
# 15  0.55391765 -0.28477301   pink
# 16  0.55391765 -0.28477301   pink
# 17  0.87813349 -0.70920076   pink
# 18  0.68864025  1.02557137   pink
# 19 -0.30596266  0.18130348   pink
# 20  0.68864025  1.02557137   pink
# 21  0.70135590  0.12385424    red
# 22  0.11068272  1.36860228    red
# 23 -1.96661716  0.58461375    red
# 24  0.40077145 -0.04287046    red
# 25  1.78691314  1.51647060    red
# 26 -0.55584113 -0.22577099    red
# 27  0.40077145 -0.04287046    red
# 28  1.78691314  1.51647060    red
# 29 -0.47279141  0.21594157    red
# 30 -0.47279141  0.21594157    red
# 31 -1.02600445 -0.33320738 yellow
# 32 -0.72889123 -1.01857538 yellow
# 33  1.25381492  2.05008469 yellow
# 34  0.83778704  0.44820978 yellow
# 35  1.25381492  2.05008469 yellow
# 36 -0.62503927 -1.07179123 yellow
# 37 -0.62503927 -1.07179123 yellow
# 38  0.83778704  0.44820978 yellow
# 39 -0.21797491 -0.50232345 yellow
# 40 -1.68669331  0.30352864 yellow

- www

那很容易。谢谢！我很惊讶 dplyr 不能处理这个。有没有办法合并多个因素？ - Vedda

1

谢谢。我也有同感。对于这个任务，我的第一个想法也是将 n() 放到 size 参数中，但它只会返回 Error: This function should not be called directly。 - www

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mt1022 · Accepted Answer

另一个解决方法是使用sample_frac：

outdat <- df %>%
    group_by(color) %>%
    sample_frac(1, replace = TRUE)
outdat
# # A tibble: 40 x 3
# # Groups:   color [4]
#             X1          X2 color
#          <dbl>       <dbl> <chr>
#  1  0.69256186  0.97180252  blue
#  2  1.54384827 -0.20268802  blue
#  3 -1.20068240 -0.45402013  blue
#  4  2.63407877 -0.31644247  blue
#  5  1.20716737 -0.91380874  blue
#  6  0.01067475  1.02004679  blue
#  7  0.01067475  1.02004679  blue
#  8  1.79732108 -0.04072946  blue
#  9  0.01067475  1.02004679  blue
# 10  1.79732108 -0.04072946  blue
# # ... with 30 more rows

此外，使用outdat %>% ungroup()来取消分组。