根据两个字符列的差异创建R数据框列

4

我有一个数据框 df,其中有两列,一列是歌曲的标题,另一列是组合的标题和艺术家。我希望创建一个单独的艺术家字段。 这里展示了前三行:

title                               titleArtist
I'll Never Smile Again  I'll Never Smile Again TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS
Imagination         Imagination GLENN MILLER & HIS ORCHESTRA / RAY EBERLE
The Breeze And I    The Breeze And I JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY

这组数据没有问题,这段代码也没有问题。
library(stringr)
library(dplyr)

 df %>% 
 head(3) %>% 
 mutate(artist=str_to_title(str_trim(str_replace(titleArtist,title,"")))) %>% 
 select(artist,title)

 artist                                                         title
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again
2                  Jimmy Dorsey & His Orchestra / Bob Eberly       The Breeze And I
 3                  Glenn Miller & His Orchestra / Ray Eberle            Imagination

但是当我将其应用于成千上万行时,就会出现错误。

Error: Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

#or for part of the mutation

df$artist <-str_replace(df$titleArtist,df$title,"")

Error in stri_replace_first_regex(string, pattern, replacement, opts_regex =    attr(pattern,  : 
 Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

我已从列中删除了所有括号,代码似乎可以工作一段时间,然后出现错误。

Error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

是不是还有其他特殊字符可能会导致问题,或者可能是其他原因呢?

谢谢您!


1
traceback()有没有提供关于触发错误的任何信息? - dww
使用 gsubsub 会与 str_replace 抛出相同的错误吗?我看到你在 titleArtist 中有 /,它也可能出现在 title 中吗?没有数据访问权限真的很难分析这个问题。 - dww
检查是否有空的标题和/或艺术家。可能需要使用 ifelse() - Parfait
谢谢建议。traceback()没有提供任何有意义的信息,例如第一个错误的行号。标题中也有“/”(当记录有两个A面时出现)。我成功地用“&”替换了它,但仍然遇到相同的错误 - 不过我无法确定它是否与“&”或其他问题有关。除了“(”和“/”之外,是否还有禁止使用的字符列表可能会导致此问题? - pssguy
@dww。我已经上传到Google表格 https://docs.google.com/spreadsheets/d/1xHbRE77HrHYIlj4dChuOZz45ZPjuMKngqQUnobL0cwY/edit#gid=1828378253 - pssguy
2个回答

3
您的一般问题是str_replace将您的artist值视为正则表达式,因此由于圆括号之外的特殊字符而存在许多潜在错误。 stringr包装和简化的stringi库允许更精细的控制,包括将参数视为固定字符串而不是正则表达式。 我没有您的原始数据,但当我添加一些会导致错误的字符时,它可以工作:
library(dplyr)
library(stringi)


df = data_frame(title = c("I'll Never Smile Again (",  "Imagination.*", "The Breeze And I(?>="),
           titleArtist = c("I'll Never Smile Again ( TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS",
                            "Imagination.* GLENN MILLER & HIS ORCHESTRA / RAY EBERLE",
                            "The Breeze And I(?>= JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY"))

df %>%
  mutate(artist=stri_trans_totitle(stri_trim(stri_replace_first_fixed(titleArtist,title,"")))) %>% 
  select(artist,title)

结果:

Source: local data frame [3 x 2]

artist                     title
(chr)                     (chr)
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again (
2                  Glenn Miller & His Orchestra / Ray Eberle             Imagination.*
3                  Jimmy Dorsey & His Orchestra / Bob Eberly      The Breeze And I(?>=

1
我注意到 stringr::str_replace(titleArtist, fixed(title), "") 等同于 stringi::stri_replace_first_fixed(titleArtist, title, "") - Noam Ross
看起来它运行得很好。感谢您提供的解决方案和说明。 - pssguy

0
 df <- data.frame(ID=11:13, T_A=c('a/b','b/c','x/y'))  # T_A Title/Artist 
   ID T_A
 1 11 a/b
 2 12 b/c
 3 13 x/y

 # Title Artist are separated by /
 > within(df, T_A<-data.frame(do.call('rbind', strsplit(as.character(T_A), '/', fixed=TRUE))))
  ID T_A.X1 T_A.X2
 1 11      a      b
 2 12      b      c
 3 13      x      y

谢谢,但我不是想基于“/”拆分列。对于第一行,我将尝试将titleArtist列分成“I'll Never Smile Again”和“TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA”。 - pssguy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接