如何在R中迭代地将列传递给函数

3

我对R和Stack Overflow都不太熟悉,而且我在编程方面也没有经验,希望能得到一些帮助。我有一个数据框,想对多个变量执行相同的操作。我写了一个函数来执行所需的操作,但我不知道如何更改列名以便该函数可以单独地作用于每个变量。

#Fake Data

#index for a list of traits, and the current food type for each pet

shelterpets <- base::data.frame(
    ID                  = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"),
    index_agility       = round(runif(10, min=-0.4, max=0.4), digits = 2),
    index_boldness      = round(runif(10, min=-0.4, max=0.4), digits = 2),
    index_curiousity    = round(runif(10, min=-0.4, max=0.4), digits = 2),
    index_dexterity     = round(runif(10, min=-0.4, max=0.4), digits = 2),
    index_empathy       = round(runif(10, min=-0.4, max=0.4), digits = 2),
    food_type           = c("diet_food", "diet_food", "regular_food", "diet_food", "regular_food", "regular_food", "regular_food", "diet_food", "diet_food", "regular_food")
                                )


 
# function to look at index for each trait, current food type, and suggest changes to food type
function(petfood) {
 
# variable to capture predicted food type: diet_food, regular_food
shelterpets$food10_trait  <- NA

 
#pet previously on diet_food and above 0.10 then confirm diet_food, else predict regular_food
shelterpets$food10_trait  <- ifelse(shelterpets$food_type == "diet_food",
                                        ifelse(shelterpets$index_trait >= 0.10, "diet_food",  "regular_food"),
                                    shelterpets$food10_trait)

 
#pet previously on regular_food and below -0.10 then confirm regular_food, else predict diet_food
shelterpets$food10_trait  <- ifelse(shelterpets$food_type == "regular_food",
                                        ifelse(shelterpets$index_trait <=  -0.10, "regular_food",  "diet_food" ),
                                    shelterpets$food10_trait)

 
#typecast
shelterpets$food10_trait  <- as.factor(shelterpets$food10_trait)

 
#update trait so replace "trait" with "agility", then "boldness", etc.
       }

我想要它看起来像什么:

 ID index_agility index_boldness index_curiousity index_dexterity index_empathy    food_type food10_agility food10_boldness
1  1          0.26          -0.28             0.17            0.17          0.28    diet_food      diet_food    regular_food
2  2          0.17          -0.12            -0.25            0.06          0.06    diet_food      diet_food    regular_food
3  3          0.24           0.14            -0.13            0.25          0.28 regular_food      diet_food       diet_food
4  4         -0.07           0.30            -0.32            0.06          0.23    diet_food   regular_food       diet_food
5  5          0.33           0.00             0.13            0.23         -0.18 regular_food      diet_food       diet_food
6  6          0.17          -0.20             0.01            0.25          0.17 regular_food      diet_food    regular_food

  food10_curiousity food10_dexterity food10_empathy
1         diet_food        diet_food      diet_food
2      regular_food     regular_food   regular_food
3      regular_food        diet_food      diet_food
4      regular_food     regular_food      diet_food
5         diet_food        diet_food   regular_food
6         diet_food        diet_food      diet_food

我创建了这个来开始


#get names in array to hopefully pass to the function, so drop ID and food_type
pet <- as.matrix(colnames(shelterpets))
pet <- pet[-c(1,7),,drop=F]

我看到了这些问题,但我还不太明白它们的工作原理,无法适应它们:

  1. 如何将数据框的列名传递给函数
  2. 如何迭代地将参数传递给R函数

感谢您能给出的任何指导。


1
理想情况下,您的数据应该采用长格式,具有单个“指标”和单个指示列,例如敏捷性大胆度等。然后,就不需要循环了!只需两个ifelse调用即可计算food_trait列。 - Parfait
谢谢你的想法!当数据集变得非常大时(真实数据集有超过100k个观测值),这种方法是否有效,还是最适合较小的数据集? - plover
1
在处理和存储数据时,即使是数百万行的长数据通常比宽数据更好。 - Parfait
3个回答

3

你的尝试已经接近了解决方案,但是你可以循环遍历每个trait并向你的数据框中分配一个新列,该列包含函数的结果。我做了一些小修改:

ifelse(shelterpets$index_trait

to

ifelse(shelterpets[, paste0('index_', trait)]

输入可以是每个trait作为一个字符字符串,返回值可以只是as.factor(...)

# function to look at index for each trait, current food type, and suggest changes to food type
f <- function(trait, data = shelterpets) {
  
  # variable to capture predicted food type: diet_food, regular_food
  data$food10_trait  <- NA
  
  
  #pet previously on diet_food and above 0.10 then confirm diet_food, else predict regular_food
  data$food10_trait  <- ifelse(data$food_type == "diet_food",
                                      ifelse(data[, paste0('index_', trait)] >= 0.10, "diet_food",  "regular_food"),
                               data$food10_trait)
  
  
  #pet previously on regular_food and below -0.10 then confirm regular_food, else predict diet_food
  data$food10_trait  <- ifelse(data$food_type == "regular_food",
                                      ifelse(data[, paste0('index_', trait)] <=  -0.10, "regular_food",  "diet_food" ),
                               data$food10_trait)
  
  
  #typecast
  as.factor(data$food10_trait)
  
  
  #update trait so replace "trait" with "agility", then "boldness", etc.
}

## test
f('agility')
# [1] diet_food    diet_food    regular_food regular_food diet_food    regular_food regular_food diet_food    regular_food diet_food   
# Levels: diet_food regular_food

将其应用于每个特征

traits <- gsub('.*_', '', grep('index', names(shelterpets), value = TRUE))
shelterpets[, paste0('food10_', traits)] <- lapply(traits, f)

#    ID index_agility index_boldness index_curiousity index_dexterity index_empathy    food_type food10_agility food10_boldness food10_curiousity food10_dexterity food10_empathy
# 1   1          0.06          -0.34            -0.25            0.28          0.22    diet_food   regular_food    regular_food      regular_food        diet_food      diet_food
# 2   2          0.37          -0.01            -0.13            0.22          0.35    diet_food      diet_food    regular_food      regular_food        diet_food      diet_food
# 3   3          0.33          -0.07            -0.03            0.20          0.22 regular_food      diet_food       diet_food         diet_food        diet_food      diet_food
# 4   4          0.07          -0.23            -0.14           -0.29          0.05    diet_food   regular_food    regular_food      regular_food     regular_food   regular_food
# 5   5          0.23           0.06             0.09            0.24         -0.17 regular_food      diet_food       diet_food         diet_food        diet_food   regular_food
# 6   6         -0.27          -0.19            -0.23            0.37         -0.35 regular_food   regular_food    regular_food      regular_food        diet_food   regular_food
# 7   7          0.17           0.30            -0.14           -0.14         -0.11 regular_food      diet_food       diet_food      regular_food     regular_food   regular_food
# 8   8         -0.22           0.13             0.21           -0.06          0.08    diet_food   regular_food       diet_food         diet_food     regular_food   regular_food
# 9   9         -0.25           0.21            -0.02            0.09         -0.29    diet_food   regular_food       diet_food      regular_food     regular_food   regular_food
# 10 10         -0.35           0.39            -0.34            0.20          0.13 regular_food   regular_food       diet_food      regular_food        diet_food      diet_food

谢谢!我不知道paste0的想法或命令,现在正在学习grep和sub。我非常感谢您解释得清晰明了。 - plover
我之前不知道 '.*_' 这部分,但是在这里找到了另一个有用的问题,再次感谢! https://dev59.com/ymgt5IYBdhLWcg3w2A77 - plover

2

如前所述,考虑使用长数据格式进行分析操作,包括合并、绘图、建模等,而不是宽数据格式通常更适合展示和报告。虽然数据结果会有更多的行,但你可以避免循环向量化操作。另外,考虑使用within来避免在新列计算中重复数据框名称。

index_cols <- names(shelterpets)[grep("index", names(shelterpets))]

shelterpets_long <- stats::reshape(
    shelterpets, varying=index_cols, times=index_cols,
    v.names="value", timevar="indicator", ids=NULL, 
    idvar=c("ID", "food_type"), direction="long",
    new.row.names = 1:1E5
)

shelterpets_long <- base::within(shelterpets_long, {
  # pet previously on diet_food and above 0.10 then confirm diet_food, else predict regular_food
  food10_trait <- ifelse(food_type == "diet_food", 
                         ifelse(value >= 0.10, "diet_food", "regular_food"),
                         NA)
  
  # pet previously on regular_food and below -0.10 then confirm regular_food, else predict diet_food
  food10_trait  <- ifelse(food_type == "regular_food",
                          ifelse(value <=  -0.10, "regular_food",  "diet_food" ),
                          food10_trait)
  # typecast
  food10_trait  <- as.factor(food10_trait)
})

head(shelterpets_long)
#   ID    food_type     indicator value food10_trait
# 1  1    diet_food index_agility -0.29 regular_food
# 2  2    diet_food index_agility -0.39 regular_food
# 3  3 regular_food index_agility  0.23    diet_food
# 4  4    diet_food index_agility -0.36 regular_food
# 5  5 regular_food index_agility -0.34 regular_food
# 6  6 regular_food index_agility -0.01    diet_food

如果需要更快的转换为长格式,请使用以下代码:

shelterpets_long2 <- base::data.frame(
    base::expand.grid(ID=unique(shelterpets$ID), indicator=index_cols, 
                      stringsAsFactors = FALSE),
    food_type = shelterpets$food_type,
    index = base::matrix(data.matrix(shelterpets[index_cols]), 
                         ncol=1, byrow=TRUE)
)

all.equal(shelterpets_long[c("ID", "food_type", "indicator", "value")],
          shelterpets_long2[c("ID", "food_type", "indicator", "value")])
# [1] TRUE

如果您需要将页面宽度调整为宽屏,可以使用以下更快的方法。这是对@Moody_Mudskipper的答案进行微调得到的。根据需要,通过ID变量将更改后的内容merge到原始内容shelterpet中:

### FASTER RESHAPE WIDE
### (https://dev59.com/d-k5XIcBkEYKwwoY593M#55973705)
matrix_spread <- function(df1, id, key, value){
  unique_ids <- unique(df1[[key]])
  mat <- matrix( df1[[value]], ncol=length(unique_ids), byrow = FALSE)
  df2 <- data.frame(unique(df1[[id]]), mat)
  names(df2) <- c(id, paste0(value,"_",unique_ids))
  df2
}

shelterpets_wide <- matrix_spread(
    shelterpets_long, 
    id = "ID",
    key = "indicator",
    value = "food10_trait"
)

shelterpets_wide
#    ID food10_trait_index_agility food10_trait_index_boldness food10_trait_index_curiousity food10_trait_index_dexterity food10_trait_index_empathy
# 1   1               regular_food                regular_food                     diet_food                 regular_food               regular_food
# 2   2               regular_food                regular_food                  regular_food                 regular_food               regular_food
# 3   3                  diet_food                   diet_food                     diet_food                    diet_food                  diet_food
# 4   4               regular_food                regular_food                  regular_food                 regular_food               regular_food
# 5   5               regular_food                   diet_food                     diet_food                 regular_food                  diet_food
# 6   6                  diet_food                regular_food                     diet_food                 regular_food               regular_food
# 7   7               regular_food                   diet_food                     diet_food                    diet_food               regular_food
# 8   8               regular_food                   diet_food                     diet_food                 regular_food                  diet_food
# 9   9                  diet_food                regular_food                  regular_food                 regular_food               regular_food
# 10 10               regular_food                regular_food                  regular_food                 regular_food               regular_food

这非常有帮助,谢谢。快速问题:在命令“new.row.names = 1:1E5”中,1E5是什么意思?此外,当我尝试从长格式转换为宽格式时,foodtrait值不匹配,但我不确定原因。 - plover
1
1:1E51:100000 的简写,其中的数字 5 表示有多少个零。这个范围应该根据实际最终数据进行调整。在使用 matrix_spread 函数时,如果出现不匹配,请尝试 byrow = FALSE。请查看已调整结果的编辑。确保在开头使用 set.seed(###) 来再现随机数据。 - Parfait

2
您可以简单地编写以下类似的函数:
my_function<- function(x, y){
  ifelse(y == "diet_food",
       ifelse(x >= 0.10,  "diet_food", "regular_food"),
       ifelse(x <= -0.10, "regular_food",  "diet_food"))
}

data.frame(lapply(df[2:6], my_function, y=df[,7]))
  index_agility index_boldness index_curiousity index_dexterity index_empathy
1     diet_food   regular_food        diet_food       diet_food     diet_food
2     diet_food   regular_food     regular_food    regular_food  regular_food
3     diet_food      diet_food     regular_food       diet_food     diet_food
4  regular_food      diet_food     regular_food    regular_food     diet_food
5     diet_food      diet_food        diet_food       diet_food  regular_food
6     diet_food   regular_food        diet_food       diet_food     diet_food

然后您可以使用cbind将结果绑定到原始的df。您也可以使用sapply代替lapply


谢谢你提供这个想法! - plover

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接