根据两个字符列的差异创建R数据框列

Question

根据两个字符列的差异创建R数据框列

4

我有一个数据框 df，其中有两列，一列是歌曲的标题，另一列是组合的标题和艺术家。我希望创建一个单独的艺术家字段。这里展示了前三行：

title                               titleArtist
I'll Never Smile Again  I'll Never Smile Again TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS
Imagination         Imagination GLENN MILLER & HIS ORCHESTRA / RAY EBERLE
The Breeze And I    The Breeze And I JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY

这组数据没有问题，这段代码也没有问题。

library(stringr)
library(dplyr)

 df %>% 
 head(3) %>% 
 mutate(artist=str_to_title(str_trim(str_replace(titleArtist,title,"")))) %>% 
 select(artist,title)

 artist                                                         title
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again
2                  Jimmy Dorsey & His Orchestra / Bob Eberly       The Breeze And I
 3                  Glenn Miller & His Orchestra / Ray Eberle            Imagination

但是当我将其应用于成千上万行时，就会出现错误。

Error: Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

#or for part of the mutation

df$artist <-str_replace(df$titleArtist,df$title,"")

Error in stri_replace_first_regex(string, pattern, replacement, opts_regex =    attr(pattern,  : 
 Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

我已从列中删除了所有括号，代码似乎可以工作一段时间，然后出现错误。

Error: Syntax error in regexp pattern. (U_REGEX_RULE_SYNTAX)

是不是还有其他特殊字符可能会导致问题，或者可能是其他原因呢？

谢谢您！

- pssguy

1

traceback()有没有提供关于触发错误的任何信息？ - dww

使用 gsub 或 sub 会与 str_replace 抛出相同的错误吗？我看到你在 titleArtist 中有 /，它也可能出现在 title 中吗？没有数据访问权限真的很难分析这个问题。 - dww

检查是否有空的标题和/或艺术家。可能需要使用 ifelse()。 - Parfait

谢谢建议。traceback()没有提供任何有意义的信息，例如第一个错误的行号。标题中也有“/”（当记录有两个A面时出现）。我成功地用“&”替换了它，但仍然遇到相同的错误 - 不过我无法确定它是否与“&”或其他问题有关。除了“（”和“/”之外，是否还有禁止使用的字符列表可能会导致此问题？ - pssguy

@dww。我已经上传到Google表格 https://docs.google.com/spreadsheets/d/1xHbRE77HrHYIlj4dChuOZz45ZPjuMKngqQUnobL0cwY/edit#gid=1828378253 - pssguy

2个回答

0

 df <- data.frame(ID=11:13, T_A=c('a/b','b/c','x/y'))  # T_A Title/Artist 
   ID T_A
 1 11 a/b
 2 12 b/c
 3 13 x/y

 # Title Artist are separated by /
 > within(df, T_A<-data.frame(do.call('rbind', strsplit(as.character(T_A), '/', fixed=TRUE))))
  ID T_A.X1 T_A.X2
 1 11      a      b
 2 12      b      c
 3 13      x      y

- Sowmya S. Manian

谢谢，但我不是想基于“/”拆分列。对于第一行，我将尝试将titleArtist列分成“I'll Never Smile Again”和“TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA”。 - pssguy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Noam Ross · Accepted Answer

您的一般问题是str_replace将您的artist值视为正则表达式，因此由于圆括号之外的特殊字符而存在许多潜在错误。 stringr包装和简化的stringi库允许更精细的控制，包括将参数视为固定字符串而不是正则表达式。我没有您的原始数据，但当我添加一些会导致错误的字符时，它可以工作：

library(dplyr)
library(stringi)


df = data_frame(title = c("I'll Never Smile Again (",  "Imagination.*", "The Breeze And I(?>="),
           titleArtist = c("I'll Never Smile Again ( TOMMY DORSEY & HIS ORCHESTRA / FRANK SINATRA & PIED PIPERS",
                            "Imagination.* GLENN MILLER & HIS ORCHESTRA / RAY EBERLE",
                            "The Breeze And I(?>= JIMMY DORSEY & HIS ORCHESTRA / BOB EBERLY"))

df %>%
  mutate(artist=stri_trans_totitle(stri_trim(stri_replace_first_fixed(titleArtist,title,"")))) %>% 
  select(artist,title)

结果：

Source: local data frame [3 x 2]

artist                     title
(chr)                     (chr)
1 Tommy Dorsey & His Orchestra / Frank Sinatra & Pied Pipers I'll Never Smile Again (
2                  Glenn Miller & His Orchestra / Ray Eberle             Imagination.*
3                  Jimmy Dorsey & His Orchestra / Bob Eberly      The Breeze And I(?>=