在dplyr中使用按行分组进行列变换

Question

在dplyr中使用按行分组进行列变换

3

我试图创建不同列的分组，但不确定我正在使用 group_by 的方法是否最佳。我想知道是否有一种内联的方式可以进行分组？

我知道可以使用 data.table 包来完成此操作，其中语法类型为 DT[i,j,by]。

但由于这只是一个较大代码中的小部分，该代码使用 tidyverse 工作良好，因此我不想偏离这个方向。

## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10) 
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-") 
customers <- sample.int(50:100,50) 
sales <- sample.int(500:5000,50)

df <- bind_cols(data.frame(state, county,customers,sales))

## workflow

df2 <- df %>%
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales)) %>% 
  ungroup %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers),
         saleInCounty = sum(sales)) %>% 
  ungroup %>% 
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

我希望我的代码看起来像这样

df3 <- df %>%
  mutate(customerInState = sum(customers, by = state),
         saleInState = sum(sales, by = state),
         customerInCounty = sum(customers, by = county),
         saleInCounty = sum(sales, by = county),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState,
         minSale = min(salePerCountyPercent, by = state))

程序没有出错，但我知道输出结果不正确。

我明白通过重新排列mutate可能可以用更少的group_bys得到我需要的结果。但问题是，是否有一种方法在dplyr中进行内联分组(group by)操作。

- ok1more

3

更易读的语言：df <- data.frame(state, county, customers, sales) 可以将状态、县、顾客和销售数据放入一个数据框中。 - Rui Barradas

1

“group_by in line”是什么意思？请描述一下您想要做什么。您的代码似乎运行正常，除了不需要ungroup - 当您进行分组时，会自动删除先前的分组。 - January

修改了问题以使其更具体。 - ok1more

看起来像是在data.table中实现的roll_up/cube，但根据github问题在这里，它可能在dplyr中不可用。 - akrun

3个回答

3

啊，你说的是语法风格。很抱歉，tidyverse并不采用这种方式。如果你想使用tidyverse，最好使用pipes。然而：（i）一旦你对某些东西进行了分组，它将保持分组状态，直到你使用另一列再次分组。（ii）如果你要重新分组，则无需取消分组。因此，我们可以简化你的代码：

df3 <- df %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers), 
         saleInCounty = sum(sales)) %>% 
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

两个mutate和两个group_by。

现在：列的顺序不同，但我们可以轻松地测试数据是否相同：

identical((df3 %>% select(colnames(df2))), (df2)) # TRUE

(iii) 我对美国的行政结构一无所知，但我认为县级行政区划嵌套在州级行政区划内，是吗？那使用总结如何？您是否需要保留所有单独销售的数据，还是仅生成每个县和/或每个州的统计数据就足够了？

- January

3

你可以分两步完成，创建两个数据集，然后使用left_join将它们连接在一起。

library(dplyr)

df2 <- df %>%
  group_by(state) %>% 
  summarise(customerInState = sum(customers),
         saleInState = sum(sales))

df3 <- df %>%
  group_by(state, county) %>%
  summarise(customerInCounty = sum(customers),
            saleInCounty = sum(sales))

df2 <- left_join(df2, df3) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent))

最终清理。

rm(df3)

- Rui Barradas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andrew · Accepted Answer

您可以创建一个包装器来实现您想要的功能。如果您只有一个分组变量，这个特定的解决方案就可以发挥作用。祝您好运！

library(tidyverse)

mutate_by <- function(.data, group, ...) {

  group_by(.data, !!enquo(group)) %>%
    mutate(...) %>%
    ungroup

}

df1 <- df %>%
  mutate_by(state, 
            customerInState = sum(customers),
            saleInState = sum(sales)) %>%
  mutate_by(county,
            customerInCounty = sum(customers),
            saleInCounty = sum(sales)) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(state,
            minSale = min(salePerCountyPercent))

identical(df2, df1)
[1] TRUE

编辑：或者，更简洁/类似于您的代码：

df %>%
  mutate_by(customerInState = sum(customers),
            saleInState = sum(sales), group = state) %>%
  mutate_by(customerInCounty = sum(customers),
            saleInCounty = sum(sales), group = county) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(minSale = min(salePerCountyPercent), group = state)