在R语言中提取字符串中的单词

Question

在R语言中提取字符串中的单词

3

我正在尝试从匹配的模式中提取字符串的片段，并创建新变量。我尝试了许多来自“strings”包的函数，但似乎无法获得结果。下面的示例是虚构的数据。我想将一个字符字符串提取出来并将其存储到新数据帧的新列中。

示例

ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)","Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es),"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)","Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)","The Remains (2016) 1080p BlurayHorror (openload.co)" ,"Suicide Squad (2016) HDAction (openload.co)")

     >ex
[1] "The Accountant (2016)Crime (vodmovies112.blogspot.com.es)"
[2] "Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)"
[3] "Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)"
[4] "Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)"
[5] "The Remains (2016) 1080p BlurayHorror (openload.co)"
[6] "Suicide Squad (2016) HDAction (openload.co)"

genres <- c("Action","Adventure","Animation","Biography",
        "Comedy","Crime","Documentary","Drama","Family",
        "Fantasy","Film-Noir","History","Horror","Music",
        "Musical","Mystery","Romance","Sci-Fi","Sport","Thriller",
        "War","Western")

genres <- paste0("^",genres,"|")
genres[22] <- "^Western"
> genres
[1] "^Action|"      "^Adventure|"   "^Animation|"   "^Biography|"
[5] "^Comedy|"      "^Crime|"       "^Documentary|" "^Drama|"
[9] "^Family|"      "^Fantasy|"     "^Film-Noir|"   "^History|"
[13] "^Horror|"      "^Music|"       "^Musical|"     "^Mystery|"
[17] "^Romance|"     "^Sci-Fi|"      "^Sport|"       "^Thriller|"
[21] "^War|"         "^Western"

尝试实现

> df
           title year                       domain genre
1 The Accountant 2016 vodmovies112.blogspot.com.es Crime

- mikeymike

2个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- etienne · Answer 1

这里有一种可能性：

temp <- strsplit(ex, "\\(|\\)")
df <- setNames(as.data.frame(lapply(1:4,function(i) sapply(temp,"[",i)), stringsAsFactors = FALSE), c("title", "year", "genre", "domain"))
df <- df[ , c("title", "year", "domain", "genre")]
correct <- sapply(seq_along(df$genre), function(y) which(lengths(sapply(seq_along(genres), function(x) grep(genres[x], df$genre[y])))>0))
correct <- lapply(correct, function(x) paste0(genres[x], collapse = " "))
df$genre <- unlist(correct)

df
#                                         title year                       domain            genre
# 1                              The Accountant  2016 vodmovies112.blogspot.com.es            Crime
# 2 Miss Peregrine's Home for Peculiar Children  2016 vodmovies112.blogspot.com.es   Fantasy Sci-Fi
# 3     Fantastic Beasts And Where To Find Them  2016                  openload.co        Adventure
# 4                                     Ben-Hur  2016 vodmovies112.blogspot.com.es Action Adventure
# 5                                 The Remains  2016                  openload.co           Horror
# 6                               Suicide Squad  2016                  openload.co           Action

基本上，我们将向量ex分成4个部分，由括号分隔。然后我们使用这4列创建数据框df。最困难的部分是正确提取流派（因为每部电影可能有多个流派）。我使用sapply、lapply和grep的组合来完成它。完成后，我们“纠正”流派列。

以下是您的数据：

ex <- c("The Accountant (2016)Crime (vodmovies112.blogspot.com.es)", 
"Miss Peregrine's Home for Peculiar Children (2016)FantasySci-Fi (vodmovies112.blogspot.com.es)", 
"Fantastic Beasts And Where To Find Them (2016) TSAdventure (openload.co)", 
"Ben-Hur (2016) HDActionAdventure (vodmovies112.blogspot.com.es)", 
"The Remains (2016) 1080p BlurayHorror (openload.co)", "Suicide Squad (2016) HDAction (openload.co)"
)

genres <- c("Action", "Adventure", "Animation", "Biography", "Comedy", 
"Crime", "Documentary", "Drama", "Family", "Fantasy", "Film-Noir", 
"History", "Horror", "Music", "Musical", "Mystery", "Romance", 
"Sci-Fi", "Sport", "Thriller", "War", "Western")

- Tyler Rinker · Answer 2

使用tidyverse的另一种可能性：

library(tidyverse)

data_frame(x = ex) %>%
    extract(
        x,
        c("title", "year", "domain", "genre"), 
        "(^[^(]+)\\s+\\((\\d{4})\\)\\s*([^(]+)\\s+\\(([^)]+)"
    )

##                                         title  year             domain                        genre
## *                                       <chr> <chr>              <chr>                        <chr>
## 1                              The Accountant  2016              Crime vodmovies112.blogspot.com.es
## 2 Miss Peregrine's Home for Peculiar Children  2016      FantasySci-Fi vodmovies112.blogspot.com.es
## 3     Fantastic Beasts And Where To Find Them  2016        TSAdventure                  openload.co
## 4                                     Ben-Hur  2016  HDActionAdventure vodmovies112.blogspot.com.es
## 5                                 The Remains  2016 1080p BlurayHorror                  openload.co
## 6                               Suicide Squad  2016           HDAction                  openload.co