在R中将文本与数据框列进行匹配

4
我有一个r语言中的单词向量。
words = c("Awesome","Loss","Good","Bad")

And,I have following dataframe in r

ID           Response
1            Today is an awesome day
2            Yesterday was a bad day,but today it is good
3            I have losses today

我想要做的是,在响应列中匹配的词语应该被提取并插入到数据框的新列中。最终输出应该如下所示。
ID           Response                        Match          Count 
1            Today is an awesome day        Awesome           1
2            Yesterday was a bad day        Bad,Good          2 
             ,but today it is good      
3            I have losses today             Loss             1

我在R中进行了以下操作。
sapply(words,grepl,df$Response)

它匹配了单词,但我如何将我的数据框格式化为所需格式?请帮忙。
4个回答

5

使用基础的R语言 - (感谢PereG在对df $ Counts的简洁回答中提供的帮助)

# extract the list of matching words
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))

# paste the matching words together
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))

# count the number of matching words
df$Count <- apply(x, 1, function(i) sum(i))

# df
#  ID                                     Response    Words Count
#1  1                      Today is an awesome day  Awesome     1
#2  2 Yesterday was a bad day,but today it is good Good,Bad     2
#3  3                          I have losses today     Loss     1

2
这个问题似乎包含了部分匹配(losses和loss)。df $ Count <- apply(sapply(tolower(words),grepl,df $ Response),1,sum)可以工作。 - PereG
@PereG 谢谢您指出这一点。我错过了这个相当重要的点。您应该将其写成答案!您值得拥有它。 - joel.wilson
谢谢,但我找不到df $ Words的解决方案,所以我想评论你的解决方案,它更完整。 - PereG

0
这里还有另一个选项,它将匹配项存储在list中:
vgrepl <- Vectorize(grepl, "pattern")
df$Match <- lapply(df$Response, function(x) 
  words[vgrepl(words, x, ignore.case=T)]
)
df$Count <- lengths(df$Match)

0

使用stringr库,假设数据框为df,则以下代码也可以实现相同功能:

matches <- sapply(1:length(words), function(i) str_extract_all(tolower(df$Response),
                                                     tolower(words[i]), simplify = TRUE))
df$Match <- gsub('[,][,]+|^,|,$', '', apply(matches, 1, paste, collapse=','))
df$Count <- apply(matches, 1, function(x) sum(x != ''))
head(df)

#  ID                                     Response    Match Count
#1  1                      Today is an awesome day  awesome     1
#2  2 Yesterday was a bad day,but today it is good good,bad     2
#3  3                          I have losses today     loss     1

0

tidyverse中的解决方案/建议。它报告实际匹配项而不是匹配不区分大小写的模式,但对于说明目的应该足够了。

library(stringr)
library(dplyr)
library(purrr)

words <- c("Awesome", "Loss", "Good", "Bad")
"ID;Response
1;Today is an awesome day
2;Yesterday was a bad day,but today it is good
3;I have losses today" %>%
  textConnection %>%
  read.table(header = TRUE, 
             sep = ";",
             stringsAsFactors = FALSE) ->
  d

d %>%
  mutate(matches = str_extract_all(
                     Response,
                     str_c(words, collapse = "|") %>% regex(ignore_case = T)),
         Match = map_chr(matches, str_c, collapse = ","),
         Count = map_int(matches, length))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接