将连续变量分成等大小的组

Question

将连续变量分成等大小的组

71

我需要将一个连续变量分成三个等大小的组。

数据框示例：

das <- data.frame(anim = 1:15,
                  wt = c(181,179,180.5,201,201.5,245,246.4,
                         189.3,301,354,369,205,199,394,231.3))

按照wt的值进行分割后，我需要在新变量wt2下有这样的3个类：

> das 
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

这将应用于大数据集。

- baz

4

见例：https://dev59.com/z2025IYBdhLWcg3wl3Em，https://dev59.com/5XE85IYBdhLWcg3wtV1T，https://dev59.com/0G035IYBdhLWcg3wMM8G，https://dev59.com/0VTTa4cB1Zd3GeqPvs25，https://dev59.com/S1bTa4cB1Zd3GeqP-Wie，http://stackoverflow.com/questions/3288361/create-size-categories-without-nested-ifelse-in-r，...。需要翻译的内容是一些关于使用R进行数据分组和离散化的问题。这些问题涉及到将一组值划分为若干个等长的区间，或者根据值的范围将其分成不同的类别，并将结果用于进一步的分析和可视化。这些问题的解决方案通常涉及R中的函数和库，例如cut、cut2和ggplot2等。 - Joris Meys

1

你确定 @Ben Bolker 的回答不是正确的吗？你指定你想要相等大小的组。 - pir

11个回答

69

试试这个：

split(das, cut(das$anim, 3))

如果你想根据 wt 的值进行分割，那么：

library(Hmisc) # cut2
split(das, cut2(das$wt, g=3))

无论如何，您可以通过结合cut、cut2和split来实现这一点。

已更新

如果您想要一个组索引作为另外一列，则：

das$group <- cut(das$anim, 3)

如果该列应该像1、2、...这样索引，则

das$group <- as.numeric(cut(das$anim, 3))

再次更新

尝试这个：

> das$wt2 <- as.numeric(cut2(das$wt, g=3))
> das
   anim    wt wt2
1     1 181.0   1
2     2 179.0   1
3     3 180.5   1
4     4 201.0   2
5     5 201.5   2
6     6 245.0   2
7     7 246.4   3
8     8 189.3   1
9     9 301.0   3
10   10 354.0   3
11   11 369.0   3
12   12 205.0   2
13   13 199.0   1
14   14 394.0   3
15   15 231.3   2

- kohske

3

您可以去掉 as.numeric 并使用 cut(das$anim, 3, labels=FALSE)。意思是将 das 数据框中的 anim 列分成 3 个等距区间，并返回每个值所在的区间编号。 - Ben

2

这应该更新一下，以便清楚地表明它与下面@Ben的答案不同。我错误地使用了这段代码，认为它会平均分配观察值。 - pir

你确定 Hmisc::cut2() 的解决方案不行吗？能否给出一个小例子说明它不行的情况？ - Ben Bolker

3

我很困惑为什么这个答案被接受了，因为问题明确指出了“相等大小的组”，而cut()无法实现。 - ForceLeft415

11

如果你想将数据分为三个均等分布的组，答案与Ben Bolker上面的回答相同-使用ggplot2 :: cut_number()。为了完整起见，这里介绍了将连续变量转换为分类变量（分箱）的三种方法。

cut_number()：使用（大约）相等数量的观测值创建n组
cut_interval()：创建范围相等的n组
cut_width()：创建宽度为w的组

我的首选是cut_number()，因为它使用均匀间隔的分位数对观测值进行分组。下面是一个偏斜数据的示例。

library(tidyverse)

skewed_tbl <- tibble(
    counts = c(1:100, 1:50, 1:20, rep(1:10, 3), 
               rep(1:5, 5), rep(1:2, 10), rep(1, 20))
    ) %>%
    mutate(
        counts_cut_number   = cut_number(counts, n = 4),
        counts_cut_interval = cut_interval(counts, n = 4),
        counts_cut_width    = cut_width(counts, width = 25)
        ) 

# Data
skewed_tbl
#> # A tibble: 265 x 4
#>    counts counts_cut_number counts_cut_interval counts_cut_width
#>     <dbl> <fct>             <fct>               <fct>           
#>  1      1 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  2      2 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  3      3 [1,3]             [1,25.8]            [-12.5,12.5]    
#>  4      4 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  5      5 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  6      6 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  7      7 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  8      8 (3,13]            [1,25.8]            [-12.5,12.5]    
#>  9      9 (3,13]            [1,25.8]            [-12.5,12.5]    
#> 10     10 (3,13]            [1,25.8]            [-12.5,12.5]    
#> # ... with 255 more rows

summary(skewed_tbl$counts)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1.00    3.00   13.00   25.75   42.00  100.00

# Histogram showing skew
skewed_tbl %>%
    ggplot(aes(counts)) +
    geom_histogram(bins = 30)

# cut_number() evenly distributes observations into bins by quantile
skewed_tbl %>%
    ggplot(aes(counts_cut_number)) +
    geom_bar()

# cut_interval() evenly splits the interval across the range
skewed_tbl %>%
    ggplot(aes(counts_cut_interval)) +
    geom_bar()

# cut_width() uses the width = 25 to create bins that are 25 in width
skewed_tbl %>%
    ggplot(aes(counts_cut_width)) +
    geom_bar()

^{这段内容是由reprex package（v0.2.1）在2018年11月01日创建的。}

- Matt Dancho

10

这里是另一种解决方案，使用mltools包中的bin_data()函数。

library(mltools)

# Resulting bins have an equal number of observations in each group
das[, "wt2"] <- bin_data(das$wt, bins=3, binType = "quantile")

# Resulting bins are equally spaced from min to max
das[, "wt3"] <- bin_data(das$wt, bins=3, binType = "explicit")

# Or if you'd rather define the bins yourself
das[, "wt4"] <- bin_data(das$wt, bins=c(-Inf, 250, 322, Inf), binType = "explicit")

das
   anim    wt                                  wt2                                  wt3         wt4
1     1 181.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
2     2 179.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
3     3 180.5              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
4     4 201.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
5     5 201.5 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
6     6 245.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
7     7 246.4              [245.466666666667, 394]              [179, 250.666666666667) [-Inf, 250)
8     8 189.3              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
9     9 301.0              [245.466666666667, 394] [250.666666666667, 322.333333333333)  [250, 322)
10   10 354.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
11   11 369.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
12   12 205.0 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)
13   13 199.0              [179, 200.333333333333)              [179, 250.666666666667) [-Inf, 250)
14   14 394.0              [245.466666666667, 394]              [322.333333333333, 394]  [322, Inf]
15   15 231.3 [200.333333333333, 245.466666666667)              [179, 250.666666666667) [-Inf, 250)

- Ben

8

不使用cut2的替代方案。

das$wt2 <- as.factor( as.numeric( cut(das$wt,3)))

或者。

das$wt2 <- as.factor( cut(das$wt,3, labels=F))

正如@ben-bolker所指出的那样，这将分成等宽度而不是数量。我认为，使用分位数可以近似等占用。

x = rnorm(10)
x
 [1] -0.1074316  0.6690681 -1.7168853  0.5144931  1.6460280  0.7014368
 [7]  1.1170587 -0.8503069  0.4462932 -0.1089427
bin = 3 #for 1/3 rd, 4 for 1/4, 100 for 1/100th etc
xx = cut(x, quantile(x, breaks=1/bin*c(1:bin)), labels=F, include.lowest=T)
table(xx)
1 2 3 4
3 2 2 3

- pedrosaurio

7

我认为这会将数据分成等宽的区间而不是等数量的区间？ - Ben Bolker

7

dplyr中的ntile现在可以实现这一功能，但是在处理NA时表现得很奇怪。

我已经在下面的函数中使用了类似的代码，它在基本R中有效，并且相当于上面提到的cut2解决方案：

ntile_ <- function(x, n) {
    b <- x[!is.na(x)]
    q <- floor((n * (rank(b, ties.method = "first") - 1)/length(b)) + 1)
    d <- rep(NA, length(x))
    d[!is.na(x)] <- q
    return(d)
}

- Dan Lewer

5

cut函数在没有给定明确的分割点时，将值划分为相同宽度的区间，但通常情况下这些区间不包含相等数量的元素：

x <- c(1:4,10)
lengths(split(x, cut(x, 2)))
# (0.991,5.5]    (5.5,10] 
#           4           1

Hmisc::cut2和ggplot2::cut_number使用分位数，如果数据分布良好且规模适中，则通常会创建相同大小的组（元素数量方面），但并非总是如此。 mltools::bin_data可以给出不同的结果，但也基于分位数。

当数据包含少量不同值时，这些函数并不总是能够给出整洁的结果：

x <- rep(c(1:20),c(15, 7, 10, 3, 9, 3, 4, 9, 3, 2,
                   23, 2, 4, 1, 1, 7, 18, 37, 6, 2))

table(x)
# x
#  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
# 15  7 10  3  9  3  4  9  3  2 23  2  4  1  1  7 18 37  6  2   

table(Hmisc::cut2(x, g=4))
# [ 1, 6) [ 6,12) [12,19) [19,20] 
#      44      44      70       8

table(ggplot2::cut_number(x, 4))
# [1,5]  (5,11] (11,18] (18,20] 
#    44      44      70       8

table(mltools::bin_data(x, bins=4, binType = "quantile"))
# [1, 5)  [5, 11) [11, 18) [18, 20] 
#     35       30       56       45

这里不清楚是否已找到最优解决方案。

关于最佳分箱方法是一个主观问题，但一种合理的方法是寻找使期望箱子大小周围方差最小的箱子。

函数`smart_cut`来自（我的）包cutr提供了这样的功能。虽然计算量很大，但应该仅用于切点和唯一值较少的情况（通常在此类情况下才有影响）。

# devtools::install_github("moodymudskipper/cutr")
table(cutr::smart_cut(x, list(4, "balanced"), "g"))
# [1,6)  [6,12) [12,18) [18,20] 
# 44      44      33      45

我们可以看到这些组已经更加平衡了。

在调用中，"balanced" 实际上可以被替换为自定义函数，以优化或限制箱子的数量，如果基于方差的方法不足够的话。

- moodymudskipper

1

equal_freq 函数来自 funModeling 库，接受一个向量和基于等频的箱数：

das <- data.frame(anim=1:15,
                  wt=c(181,179,180.5,201,201.5,245,246.4,
                       189.3,301,354,369,205,199,394,231.3))

das$wt_bin=funModeling::equal_freq(das$wt, 3)

table(das$wt_bin)

#[179,201) [201,246) [246,394] 
#        5         5         5

- Pablo Casas

1

你还可以使用OneR包中的method = "content"和bin函数来实现此目的：

library(OneR)
das$wt_2 <- as.numeric(bin(das$wt, nbins = 3, method = "content"))
das
##    anim    wt wt_2
## 1     1 181.0    1
## 2     2 179.0    1
## 3     3 180.5    1
## 4     4 201.0    2
## 5     5 201.5    2
## 6     6 245.0    2
## 7     7 246.4    3
## 8     8 189.3    1
## 9     9 301.0    3
## 10   10 354.0    3
## 11   11 369.0    3
## 12   12 205.0    2
## 13   13 199.0    1
## 14   14 394.0    3
## 15   15 231.3    2

- vonjd

0

没有任何额外的包，3是分组的数量：

> findInterval(das$wt, unique(quantile(das$wt, seq(0, 1, length.out = 3 + 1))), rightmost.closed = TRUE)
 [1] 1 1 1 2 2 2 3 1 3 3 3 2 1 3 2

您可以通过使用感兴趣的值的代表性样本来加速分位数计算。请仔细查看FindInterval函数的文档。

- SamGG

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ben Bolker · Accepted Answer

或者查看ggplot2包中的cut_number函数，例如：

das$wt_2 <- as.numeric(cut_number(das$wt,3))

请注意，cut（...，3）将原始数据范围分为三个等长的范围；如果数据不均匀分布，则不一定会导致每组相同数量的观测值（您可以通过适当使用quantile来复制cut_number所做的内容，但它是一个方便的函数）。另一方面，Hmisc :: cut2()使用g = 参数按分位数划分，因此与ggplot2 :: cut_number更或多或少等效。我认为像cut_number这样的东西到目前为止应该已经成为dplyr的一部分了，但据我所知，看起来还没有。