如何根据多个列的多个条件创建一个新的列?

13

我正在尝试根据其他列的多个条件向数据框添加新列。 我有以下数据:

> commute <- c("walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry", "walk", "bike", "subway", "drive", "ferry")
> kids <- c("Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "Yes", "No", "No", "Yes", "No", "Yes")
> distance <- c(1, 12, 5, 25, 7, 2, "", 8, 19, 7, "", 4, 16, 12, 7)
> 
> df = data.frame(commute, kids, distance)
> df
   commute kids distance
1     walk  Yes        1
2     bike  Yes       12
3   subway   No        5
4    drive   No       25
5    ferry  Yes        7
6     walk  Yes        2
7     bike   No         
8   subway   No        8
9    drive  Yes       19
10   ferry  Yes        7
11    walk   No         
12    bike   No        4
13  subway  Yes       16
14   drive   No       12
15   ferry  Yes        7
如果满足以下三个条件:
commute = walk OR bike OR subway OR ferry
AND
kids = Yes
AND
distance is less than 10

然后我想要一个名为get.flyer的新列,其值等于"Yes"。最终数据框应该是这样的:

   commute kids distance get.flyer
1     walk  Yes        1       Yes
2     bike  Yes       12       Yes
3   subway   No        5          
4    drive   No       25          
5    ferry  Yes        7       Yes
6     walk  Yes        2       Yes
7     bike   No                   
8   subway   No        8          
9    drive  Yes       19          
10   ferry  Yes        7       Yes
11    walk   No                   
12    bike   No        4          
13  subway  Yes       16       Yes
14   drive   No       12          
15   ferry  Yes        7       Yes

请尝试遵循此链接中的内容:https://dev59.com/eG025IYBdhLWcg3whGSx#38523589 - user2100721
3个回答

17

我们可以使用%in%来比较列中的多个元素,&用于检查两个条件是否都为TRUE。

library(dplyr)
df %>%
     mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") & 
           as.character(kids) == "Yes" & 
           as.numeric(as.character(distance)) < 10)+1] )
最好使用stringsAsFactors=FALSE创建data.frame,因为默认设置是TRUE。如果我们检查str(df),我们会发现所有列都是factor类。此外,如果有缺失值,可以使用NA来避免将numeric列的class转换为其他类型,而不是使用""
如果我们重写创建'df',
distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)
df1 <- data.frame(commute, kids, distance, stringsAsFactors=FALSE)

以上代码可以简化。

df1 %>%
    mutate(get.flyer = c("", "Yes")[(commute %in% c("walk", "bike", "subway", "ferry") &
        kids == "Yes" &
        distance < 10)+1] )

为了更好地理解,一些人更喜欢使用ifelse

df1 %>% 
   mutate(get.flyer = ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
                kids == "Yes" &
                distance < 10, 
                          "Yes", ""))

这也可以通过base R方法轻松完成

df1$get.flyer <- with(df1, ifelse(commute %in% c("walk", "bike", "subway", "ferry") & 
              kids == "Yes" & 
              distance < 10, 
                       "Yes", ""))

10

解决方案已经被@akrun指出。我会用更加简单的方式来介绍。

您可以使用ifelse语句根据一个或多个条件创建一列。但是首先您需要更改距离列中缺失值的“编码”。您使用""来表示缺失值,这会将整个列转换为string并阻止数值比较(distance < 10不可能)。在R中,表示缺失值的方式是NA,您的distance列定义应为:

distance <- c(1, 12, 5, 25, 7, 2, NA, 8, 19, 7, NA, 4, 16, 12, 7)

ifelse 语句看起来像这样:

df$get.flyer <- ifelse(
    ( 
        (df$commute %in% c("walk", "bike", "subway", "ferry")) &
        (df$kids == "Yes")                                     &
        (df$distance < 10)
    ),
    1,  # if condition is met, put 1
    0   # else put 0
)

可选:也可以考虑以不同的方式编码其他列:

  • 对于kids变量,您可以使用TRUEFALSE代替"Yes"和"No"
  • 你可以为通勤时间(commute)使用一个factor

3

例如,检查第一列名称是否包含在第二列名称中,并将结果写入新列

df$new_column <- apply(df, 1, function(x) grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE))

细节:

df$new_column <- # create a new column with name new_column on df
apply(df, 1 # `1` means for each row, `apply(df` means apply the following function on df
function(x) # Function definition to apply on each row, `x` means input row for each row.
grepl(x['first_column_name'], x['second_column_name'], fixed = TRUE)) # Body of function to apply, basically run grepl to find if first_column_name is in second_column_name, fixed = TRUE means don't use regular expression just the plain text from first_column_name.

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接