如何在R中对分组数据框中的子组进行归一化处理

Question

如何在R中对分组数据框中的子组进行归一化处理

8

我有一个数据框，其中包含两个数值变量fatcontent和saltcontent以及两个因子变量cond和spice，用于描述不同的处理方法。在这个数据框中，每个数值变量的测量值都被重复了两次。

a <- data.frame(cond = rep(c("uncooked", "fried", "steamed", "baked", "grilled"),
                       each = 2, times = 3),
                spice = rep(c("none", "chilli", "basil"), each = 10),
                fatcontent = c(4, 5, 6828, 7530, 6910, 7132, 5885, 613, 2845, 2867,
                               25, 18, 2385, 33227, 4233, 4023, 953, 1025, 4465, 5016,
                               5, 5, 10235, 12545, 5511, 5111, 596, 585, 4012, 3633),
                saltcontent = c(2, 5, 4733, 5500, 5724, 15885, 14885, 217, 193, 148,
                                6, 4, 26738, 24738, 22738, 23738, 267, 256, 1121, 1558,
                                1, 1, 21738, 20738, 26738, 27738, 195, 202, 129, 131)
                )

现在，我希望对每种香料组的数字变量进行归一化处理（在这种情况下意味着除以平均值），通过未加工条件的平均值。
例如，对于a$spice == "none"

       cond  spice fatcontent saltcontent  
1  uncooked   none          4           2  
2  uncooked   none          5           5  
3     fried   none       6828        4733  
4     fried   none       7530        5500  
5   steamed   none       6910        5724  
6   steamed   none       7132       15885  
7     baked   none       5885       14885  
8     baked   none        613         217  
9   grilled   none       2845         193  
10  grilled   none       2867         148

标准化后：

       cond spice   fatcontent  saltcontent
1  uncooked  none    0.8888889    0.5714286
2  uncooked  none    1.1111111    1.4285714
3     fried  none 1517.3333333 1352.2857143
4     fried  none 1673.3333333 1571.4285714
5   steamed  none 1535.5555556 1635.4285714
6   steamed  none 1584.8888889 4538.5714286
7     baked  none 1307.7777778 4252.8571429
8     baked  none  136.2222222   62.0000000
9   grilled  none  632.2222222   55.1428571
10  grilled  none  637.1111111   42.2857143

我的问题是如何针对数据框中的所有组和变量执行此操作？我假设可以使用dplyr包，但不确定最佳方法是什么。感谢任何帮助！

- karnowski

3个回答

4

我认为这是你想要的。你希望使用未经烹饪的数据点找到每个调味料条件的平均值。这是我第一步所做的事情。然后，我想将fatmean和saltmean添加到您的数据框a中的ana中。如果您的数据非常庞大，则这可能不是一种内存有效的方法。但是，我使用left_join合并了ana和a。然后，我在每个调味料条件的mutate中进行了除法运算。最后，我使用select删除了两列以整理结果。

### Find mean for each spice condition using uncooked data points                
ana <- group_by(filter(a, cond == "uncooked"), spice) %>%
       summarise(fatmean = mean(fatcontent), saltmean = mean(saltcontent)) 

 #   spice fatmean saltmean
 #1  basil     5.0      1.0
 #2 chilli    21.5      5.0
 #3   none     4.5      3.5

left_join(a, ana, by = "spice") %>%
group_by(spice) %>%
mutate(fatcontent = fatcontent / fatmean,
       saltcontent = saltcontent / saltmean) %>%
select(-c(fatmean, saltmean))

# A part of the results
#       cond spice   fatcontent  saltcontent
#1  uncooked  none    0.8888889    0.5714286
#2  uncooked  none    1.1111111    1.4285714
#3     fried  none 1517.3333333 1352.2857143
#4     fried  none 1673.3333333 1571.4285714
#5   steamed  none 1535.5555556 1635.4285714
#6   steamed  none 1584.8888889 4538.5714286
#7     baked  none 1307.7777778 4252.8571429
#8     baked  none  136.2222222   62.0000000
#9   grilled  none  632.2222222   55.1428571
#10  grilled  none  637.1111111   42.2857143

如果您将所有事情都放在一个管道中，它可能是这样的：

group_by(filter(a, cond == "uncooked"), spice) %>%
    summarise(fatmean = mean(fatcontent), saltmean = mean(saltcontent)) %>%
    left_join(a, ., by = "spice") %>% #right_join is possible with the dev dplyr
    group_by(spice) %>%
    mutate(fatcontent = fatcontent / fatmean,
           saltcontent = saltcontent / saltmean) %>%
    select(-c(fatmean, saltmean))

- jazzurro

谢谢jazurro，这正是我想要做的。祝好，Alex - karnowski

@karnowski 没问题，你写得非常好！ :) - jazzurro

2

你只需要按照条件和调料进行分组，像这样：

所有你需要做的就是按照条件和调料进行分组，如下所示：

library(dplyr)
a %>% group_by(spice, cond) %>%
  mutate(fat.norm = fatcontent / mean(fatcontent),
         salt.norm = saltcontent / mean(saltcontent))

# Source: local data frame [90 x 6]
# Groups: spice, cond
# 
#        cond  spice fatcontent saltcontent  fat.norm  salt.norm
# 1  uncooked   none          4           2 0.8888889 0.57142857
# 2  uncooked   none          5           5 1.1111111 1.42857143
# 3     fried   none       6828        4733 0.9511074 0.92504642
# 4     fried   none       7530        5500 1.0488926 1.07495358
# 5   steamed   none       6910        5724 0.9841903 0.52977926
# 6   steamed   none       7132       15885 1.0158097 1.47022074
# 7     baked   none       5885       14885 1.8113266 1.97126208
# 8     baked   none        613         217 0.1886734 0.02873792
# 9   grilled   none       2845         193 0.9961485 1.13196481
# 10  grilled   none       2867         148 1.0038515 0.86803519

或者，如果您不想指定每个列，可以使用mutate_each或summarise_each：

group.norm <- function(x) {
  x / mean(x)
}

a %>% group_by(spice, cond) %>%
  mutate_each(funs(group.norm))

在mutate_each()中，您也可以排除某些列或仅指定特定列，如mutate_each(funs(group.norm), -notthisone)或mutate_each(funs(group.norm), onlythisone)

- Andrew

哎呀，我误读了原帖。这只是通过组平均值进行归一化，而不是未处理的平均值。@jazzurro的答案是正确的。 - Andrew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- talat · Accepted Answer

一种简洁的归一化数据的方法是在均值计算中包含“未加工”条件，这样您就不需要过滤、汇总、连接和重新计算。使用mutate_each实现这一点意味着您只需要输入一次。

group_by(a, spice) %>%
  mutate_each(funs(./mean(.[cond == "uncooked"])), -cond)

#Source: local data frame [30 x 4]
#Groups: spice
#
#       cond  spice   fatcontent  saltcontent
#1  uncooked   none    0.8888889 5.714286e-01
#2  uncooked   none    1.1111111 1.428571e+00
#3     fried   none 1517.3333333 1.352286e+03
#4     fried   none 1673.3333333 1.571429e+03
#5   steamed   none 1535.5555556 1.635429e+03
#6   steamed   none 1584.8888889 4.538571e+03
#7     baked   none 1307.7777778 4.252857e+03
#8     baked   none  136.2222222 6.200000e+01
#9   grilled   none  632.2222222 5.514286e+01
#10  grilled   none  637.1111111 4.228571e+01
# ... etc