在R中，使用dplyr的mutate()函数基于另一个变量的内容创建一个新的变量。

Question

在R中，使用dplyr的mutate()函数基于另一个变量的内容创建一个新的变量。

3

我想搜索一个变量 placement 的内容，并根据所寻找的模式创建一个新变量 term。以下是一个最简示例...

首先，我创建一个搜索模式函数：

calcterm <- function(x){    # calcterm takes a column argument to read
    print(x)
    if (x %in% '_fa_') {
            return ('fall')
    } else if (x %in% '_wi_') {
            return('winter')
    } else if (x %in% '_sp_') {
            return('spring')
    } else {return('summer')
    }
}

我将创建一个小数据框，然后将其传递给dplyr的tbl_df:

placement <- c('pn_ds_ms_fa_th_hrs','pn_ds_ms_wi_th_hrs' ,'pn_ds_ms_wi_th_hrs')
hours <- c(1230, NA, 34)

d <- data.frame(placement, hours)

library(dplyr)

d <- tbl_df(d)

表格d现在显示为：

>d
    Source: local data frame [3 x 2]

       placement hours
          (fctr) (dbl)
1 pn_ds_ms_fa_th_hrs  1230
2 pn_ds_ms_wi_th_hrs    NA
3 pn_ds_ms_wi_th_hrs    34

接下来，我使用mutate来实现我的函数。目标是读取placement的内容，并创建一个新变量，该变量将根据在placement列中找到的模式而产生fall、winter、spring或summer的值。

d %>% mutate(term=calcterm(placement))

这句话的意思是：“输出结果让我只剩下...”

[1] pn_ds_ms_fa_th_hrs pn_ds_ms_wi_th_hrs pn_ds_ms_wi_th_hrs
Levels: pn_ds_ms_fa_th_hrs pn_ds_ms_wi_th_hrs
Source: local data frame [3 x 3]

       placement hours   term
          (fctr) (dbl)  (chr)
1 pn_ds_ms_fa_th_hrs  1230 summer
2 pn_ds_ms_wi_th_hrs    NA summer
3 pn_ds_ms_wi_th_hrs    34 summer

Warning messages:
    1: In if (x %in% "_fa_") { :
      the condition has length > 1 and only the first element will be used
    2: In if (x %in% "_wi_") { :
      the condition has length > 1 and only the first element will be used
    3: In if (x %in% "_sp_") { :
      the condition has length > 1 and only the first element will be used

所以，很明显我在一开始写错了什么...或许%in%可以用grep模式来代替？我不确定该如何处理。

谢谢。

更新

根据下面的回答，我更新了我的完整管道系列，以展示我是如何实现的。我正在处理的数据是“宽”形式的，我首先只是翻转了它的轴，并从其列名中提取有用的信息。这个例子可行——但在我的数据中，在进行mutate()步骤时，我收到以下消息：Error: invalid subscript type 'list'

需要注意的是，在summarise()之后，我会收到警告：

Warning message:
attributes are not identical across measure variables; they will be dropped

也许这与下一步失败有关？因为这个警告在我的示例中没有出现？

set.seed(1) 

dfmaker <- function() {
        setNames(
                data.frame(
                        replicate(5, sample(c(NA, 300:500), 4, TRUE), FALSE)), 
                c('pn_ds_ms_fa_th_hrs','rn_ds_ms_wi_th_stu' ,'adn_ds_ms_wi_th_hrs','pn_ds_ms_wi_th_hrs' ,'rn_bsn_ds_ms_wi_th_hrs'))
}


d <- dfmaker()

library(dplyr)

d <- tbl_df(d)

grepl_vec_pattern = Vectorize(grepl, 'pattern')

calcterm = function(s) {
        require(pryr)
        s = as.character(s)
        grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi'))
        stopifnot(any(rowSums(grepped_patterns) == 1))   # Ensure that there is exactly one match
        reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which))
        lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter')
        lut_table[reduce_to_colname_with_true]
}

select(d, matches("^pn_|^adn_|^bsn_"), -starts_with("rn_bsn")) %>%  # all the pn, adn, bsn programs, for all information 
        select(contains("_hrs") ) %>%   # takes out just the hours
        gather(placement, hours) %>%  # flip it!
        group_by(placement) %>%  # gather all the schools into a single observation (replicated placement values at this point)
        summarise(sumHours = sum(hours, na.rm=T)) %>%
        mutate(term = calcterm(placement))

- M. Elliott

1

%in%是用于精确匹配正则表达式的。而mutate没有任何特殊功能，不能在基础R中完成的操作，因此在此操作中完全不需要使用dplyr。 - David Arenburg

2

你也可以在Excel中执行所有这些操作，但这并不意味着你不应该使用R。OP提出了一个关于如何在dplyr中执行操作的问题，回答问题或不回答问题都是可以的。这是dplyr的完全有效的用法。 - Paul Hiemstra

@PaulHiemstra 这个问题的标题是“使用dplyr的mutate()以...等”，而不是“如何查找匹配项...等”。我想说的是，为了解决这个问题，你不应该把重点放在如何使用dplyr::mutate（特定的工具）上，因为它并没有什么特别之处，而应该集中精力解决问题本身。 - David Arenburg

2个回答

3

问题在于您不能将逻辑向量放入 if 语句中。R 的响应只会使用逻辑向量的第一个元素，并抛出您收到的警告消息。

为了解决这个问题，我将使用 grepl。首先，让我们创建一些示例数据：

s = c('bla_wi', 'spam_sp', 'egg_sp', 'ham_fa')

接下来，我们需要认识到grepl不能同时传递多个搜索模式。幸运的是，我们可以通过对pattern参数进行向量化来解决这个问题:

grepl_vec_pattern = Vectorize(grepl, 'pattern')
grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi'))
grepped_patterns
#        _sp   _su   _fa   _wi
# [1,] FALSE FALSE FALSE  TRUE
# [2,]  TRUE FALSE FALSE FALSE
# [3,]  TRUE FALSE FALSE FALSE
# [4,] FALSE FALSE  TRUE FALSE

grepped_patterns中的每一列表示传递的模式是否匹配。接下来，我们希望将其缩减为一个向量，该向量列出了哪个模式与该元素相匹配（假设只有一个模式匹配）。

library(pryr)
reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which))
reduce_to_colname_with_true
# [1] "_wi" "_sp" "_sp" "_fa"

请注意，compose(A, B)等同于A(B())，即嵌套调用函数。我选择使用compose来避免使用匿名函数，例如：function(x) names(which(x))。

现在我们有了这些信息，需要将_sp翻译为spring等。

lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter')
lut_table[reduce_to_colname_with_true]
#      _wi      _sp      _sp      _fa 
# "winter" "spring" "spring"   "fall"

我们已经得到了所需的结果。为了在mutate中使用，我们可以将其全部封装在一个函数中：

calcterm = function(s) {
    require(pryr)
    s = as.character(s)
    grepped_patterns = grepl_vec_pattern(s, pattern = c('_sp', '_su', '_fa', '_wi'))
    stopifnot(any(rowSums(grepped_patterns) == 1))   # Ensure that there is exactly one match
    reduce_to_colname_with_true = apply(grepped_patterns, 1, compose(names, which))
    lut_table = c('_sp' = 'spring', '_su' = 'summer', '_fa' = 'fall', '_wi' = 'winter')
    lut_table[reduce_to_colname_with_true]
}
library(dplyr)
df = data.frame(s = s) %>% mutate(term = calcterm(s))
df
        s   term
1  bla_wi winter
2 spam_sp spring
3  egg_sp spring
4  ham_fa   fall

- Paul Hiemstra

啊 - 我忘记了lut！是的！！谢谢，这非常有帮助！虽然在某些情况下（也许是这种情况），我可能会实现@DavidArenburg提出的方法，但你说得对 - 我真的想看看如何使用我指定的工具将其组合起来。学习不同的方法以及它们如何/为什么起作用可以帮助我在未来做出更有效的决策。 - M. Elliott

我需要在calcterm函数内部放置 grepl_vec_pattern = Vectorize(grepl, 'pattern') 吗？ - M. Elliott

@melliot 不需要。如果你将它放在全局环境的函数外面，它就会被找到。 - Paul Hiemstra

我收到了以下错误：Error: invalid subscript type 'list'。此外，还有一个警告信息：attributes are not identical across measure variables; they will be dropped。 - M. Elliott

你需要提供一个可重现的例子，否则很难给出任何反馈... - Paul Hiemstra

是的，当然，对此我感到抱歉。在创建我的示例时，函数完美地运行了，我将添加我正在进行的完整流程到帖子中... 然而，在我的真实数据上它不起作用，我无法看出原因... 这个示例是我能够复制我实际工作的最接近的东西，所以我只能继续寻找不同之处！ - M. Elliott

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Arenburg · Accepted Answer

一个简单而非常有效的方法是创建一个简单的查找/模式向量，然后结合（非常有效的）stringi::stri_detect_fixed和data.table。即使对于大型数据集，这种解决方案也应该能够很好地扩展。

library(stringi)
library(data.table)
Lookup <- c("fall", "winter", "spring")
Patterns <- c("fa", "wi", "sp")
setDT(d)[, term := Lookup[stri_detect_fixed(placement, Patterns)], by = placement]
d[is.na(term), term := "summer"]
d
#             placement hours   term
# 1: pn_ds_ms_fa_th_hrs  1230   fall
# 2: pn_ds_ms_wi_th_hrs    NA winter
# 3: pn_ds_ms_wi_th_hrs    34 winter

如果我们坚持使用 dplyr，则需要创建一个帮助函数来处理找不到匹配项的情况（这是 data.table 自动处理的情况）

f <- function(x, Lookup, Patterns) {
  temp <- Lookup[stri_detect_fixed(x[1L], Patterns)]
  if(!length(temp)) return("summer")
  temp
}

d %>%
  group_by(placement) %>%
  mutate(term = f(placement, Lookup, Patterns))

# Source: local data frame [3 x 3]
# Groups: placement [2]
# 
#           placement hours   term
#               (fctr) (dbl)  (chr)
# 1 pn_ds_ms_fa_th_hrs  1230   fall
# 2 pn_ds_ms_wi_th_hrs    NA winter
# 3 pn_ds_ms_wi_th_hrs    34 winter