在dplyr中使用按行分组进行列变换

3

我试图创建不同列的分组,但不确定我正在使用 group_by 的方法是否最佳。我想知道是否有一种内联的方式可以进行分组?

我知道可以使用 data.table 包来完成此操作,其中语法类型为 DT[i,j,by]。

但由于这只是一个较大代码中的小部分,该代码使用 tidyverse 工作良好,因此我不想偏离这个方向。

## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10) 
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-") 
customers <- sample.int(50:100,50) 
sales <- sample.int(500:5000,50)

df <- bind_cols(data.frame(state, county,customers,sales))

## workflow

df2 <- df %>%
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales)) %>% 
  ungroup %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers),
         saleInCounty = sum(sales)) %>% 
  ungroup %>% 
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

我希望我的代码看起来像这样

df3 <- df %>%
  mutate(customerInState = sum(customers, by = state),
         saleInState = sum(sales, by = state),
         customerInCounty = sum(customers, by = county),
         saleInCounty = sum(sales, by = county),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState,
         minSale = min(salePerCountyPercent, by = state))

程序没有出错,但我知道输出结果不正确。

我明白通过重新排列mutate可能可以用更少的group_bys得到我需要的结果。 但问题是,是否有一种方法在dplyr中进行内联分组(group by)操作。


3
更易读的语言:df <- data.frame(state, county, customers, sales) 可以将状态、县、顾客和销售数据放入一个数据框中。 - Rui Barradas
1
“group_by in line”是什么意思?请描述一下您想要做什么。您的代码似乎运行正常,除了不需要ungroup - 当您进行分组时,会自动删除先前的分组。 - January
修改了问题以使其更具体。 - ok1more
看起来像是在data.table中实现的roll_up/cube,但根据github问题在这里,它可能在dplyr中不可用。 - akrun
3个回答

5

您可以创建一个包装器来实现您想要的功能。如果您只有一个分组变量,这个特定的解决方案就可以发挥作用。祝您好运!

library(tidyverse)

mutate_by <- function(.data, group, ...) {

  group_by(.data, !!enquo(group)) %>%
    mutate(...) %>%
    ungroup

}

df1 <- df %>%
  mutate_by(state, 
            customerInState = sum(customers),
            saleInState = sum(sales)) %>%
  mutate_by(county,
            customerInCounty = sum(customers),
            saleInCounty = sum(sales)) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(state,
            minSale = min(salePerCountyPercent))

identical(df2, df1)
[1] TRUE

编辑:或者,更简洁/类似于您的代码:

df %>%
  mutate_by(customerInState = sum(customers),
            saleInState = sum(sales), group = state) %>%
  mutate_by(customerInCounty = sum(customers),
            saleInCounty = sum(sales), group = county) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(minSale = min(salePerCountyPercent), group = state)

1
你可以重新排列代码,使其更加简洁。首先按县分组,然后按州分组。 - January

3
啊,你说的是语法风格。很抱歉,tidyverse并不采用这种方式。如果你想使用tidyverse,最好使用pipes。然而:(i)一旦你对某些东西进行了分组,它将保持分组状态,直到你使用另一列再次分组。(ii)如果你要重新分组,则无需取消分组。因此,我们可以简化你的代码:
df3 <- df %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers), 
         saleInCounty = sum(sales)) %>% 
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

两个mutate和两个group_by。

现在:列的顺序不同,但我们可以轻松地测试数据是否相同:

identical((df3 %>% select(colnames(df2))), (df2)) # TRUE

(iii) 我对美国的行政结构一无所知,但我认为县级行政区划嵌套在州级行政区划内,是吗?那使用总结如何?您是否需要保留所有单独销售的数据,还是仅生成每个县和/或每个州的统计数据就足够了?

3
你可以分两步完成,创建两个数据集,然后使用left_join将它们连接在一起。
library(dplyr)

df2 <- df %>%
  group_by(state) %>% 
  summarise(customerInState = sum(customers),
         saleInState = sum(sales))

df3 <- df %>%
  group_by(state, county) %>%
  summarise(customerInCounty = sum(customers),
            saleInCounty = sum(sales))

df2 <- left_join(df2, df3) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent))

最终清理。
rm(df3)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接