将列中以逗号分隔的字符串拆分为单独的行

153

我有一个数据框,像这样:

data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

正如您所看到的,director列中的一些条目是由逗号分隔的多个名称。我想将这些条目拆分成单独的行,同时保留其他列的值。例如,上面数据帧中的第一行应该被拆分成两行,在director列中分别有一个名称,AB列中都为'A'。


3
问得很明显:这些数据您应该在互联网上发布吗? - Ricardo Saporta
2
它们“并非全部都是B级电影”。看起来很无害。 - Matthew Lundberg
29
所有这些人都是奥斯卡提名者,我不认为这是什么秘密 =) - RoyalTS
6个回答

147

有几种选择:

1)两种使用 的方法:

library(data.table)
# method 1 (preferred)
setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]
# method 2
setDT(v)[, strsplit(as.character(director), ",", fixed=TRUE), by = .(AB, director)
         ][,.(director = V1, AB)]

2) 一个 / 组合:

library(dplyr)
library(tidyr)
v %>% 
  mutate(director = strsplit(as.character(director), ",")) %>%
  unnest(director)

3) 仅使用: 使用 tidyr 0.5.0 (及更高版本),您也可以直接使用separate_rows

separate_rows(v, director, sep = ",")

您可以使用convert = TRUE参数自动将数字转换为数字列。

tidyr_1.3.0(及更高版本)开始,您可以使用separate_longer_delim(而separate_rows现已过时):

separate_longer_delim(v, director, delim = ",")

4) 使用基本的 R 语言:

# if 'director' is a character-column:
stack(setNames(strsplit(df$director,','), df$AB))

# if 'director' is a factor-column:
stack(setNames(strsplit(as.character(df$director),','), df$AB))

有没有一种方法可以同时对多列进行操作?例如,每个包含由“;”分隔的字符串的3列,每列具有相同数量的字符串。即 data.table(id= "X21", a = "chr1;chr1;chr1", b="123;133;134",c="234;254;268") 变成 data.table(id = c("X21","X21",X21"), a=c("chr1","chr1","chr1"), b=c("123","133","134"), c=c("234","254","268")) - Reilstein
1
哇,我刚意识到它已经可以同时处理多列了 - 这太棒了! - Reilstein
1
@Moon_Watcher 上面的答案中的方法1已经适用于多列,这让我感到惊讶。setDT(dt)[,lapply(.SD,function(x)unlist(tstrsplit(x,“;”,fixed = TRUE))),by = ID]是对我有用的。 - Reilstein
@MarkE。在添加/更新/删除列时,您可以使用 := 赋值运算符。请参见此处以获取更详细的说明 - Jaap
2
FYI,“separate_rows”已被“separate_longer_delim”取代。 - Maël
显示剩余4条评论

100

这个老问题经常被用作重复目标(标记为r-faq)。截至今天,它已经回答了三次,并提供了6种不同的方法,但缺乏基准作为指导哪种方法最快1

基准解决方案包括:

总共对6个不同大小的数据框使用microbenchmark包测试了8种不同的方法(请参见下面的代码)。

OP提供的示例数据仅包含20行。为了创建更大的数据框,这20行仅重复1、10、100、1000、10000和100000次,从而得到高达200万行的问题大小。

基准结果

enter image description here

基准结果表明,对于足够大的数据框,所有data.table方法都比任何其他方法更快。对于超过约5000行的数据框,Jaap的data.table方法2和变体DT3是最快的,比最慢的方法快好几个数量级。

值得注意的是,两种tidyverse方法和splistackshape解决方案的时间非常相似,很难在图表中区分曲线。它们是所有数据框大小的基准方法中最慢的。

对于较小的数据框,Matt的基本R解决方案和data.table方法4似乎具有较少的开销。

代码

director <- 
  c("Aaron Blaise,Bob Walker", "Akira Kurosawa", "Alan J. Pakula", 
    "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
    "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
    "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
    "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
    "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
    "Anne Fontaine", "Anthony Harvey")
AB <- c("A", "B", "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", 
        "A", "B", "B", "B", "B", "B", "B", "A")

library(data.table)
library(magrittr)

定义函数以对问题大小n进行基准测试运行

run_mb <- function(n) {
  # compute number of benchmark runs depending on problem size `n`
  mb_times <- scales::squish(10000L / n , c(3L, 100L)) 
  cat(n, " ", mb_times, "\n")
  # create data
  DF <- data.frame(director = rep(director, n), AB = rep(AB, n))
  DT <- as.data.table(DF)
  # start benchmarks
  microbenchmark::microbenchmark(
    matt_mod = {
      s <- strsplit(as.character(DF$director), ',')
      data.frame(director=unlist(s), AB=rep(DF$AB, lengths(s)))},
    jaap_DT1 = {
      DT[, lapply(.SD, function(x) unlist(tstrsplit(x, ",", fixed=TRUE))), by = AB
         ][!is.na(director)]},
    jaap_DT2 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE), 
         by = .(AB, director)][,.(director = V1, AB)]},
    jaap_dplyr = {
      DF %>% 
        dplyr::mutate(director = strsplit(as.character(director), ",")) %>%
        tidyr::unnest(director)},
    jaap_tidyr = {
      tidyr::separate_rows(DF, director, sep = ",")},
    cSplit = {
      splitstackshape::cSplit(DF, "director", ",", direction = "long")},
    DT3 = {
      DT[, strsplit(as.character(director), ",", fixed=TRUE),
         by = .(AB, director)][, director := NULL][
           , setnames(.SD, "V1", "director")]},
    DT4 = {
      DT[, .(director = unlist(strsplit(as.character(director), ",", fixed = TRUE))), 
         by = .(AB)]},
    times = mb_times
  )
}

运行不同问题规模的基准测试

# define vector of problem sizes
n_rep <- 10L^(0:5)
# run benchmark for different problem sizes
mb <- lapply(n_rep, run_mb)

为绘图准备数据

mbl <- rbindlist(mb, idcol = "N")
mbl[, n_row := NROW(director) * n_rep[N]]
mba <- mbl[, .(median_time = median(time), N = .N), by = .(n_row, expr)]
mba[, expr := forcats::fct_reorder(expr, -median_time)]

创建图表

library(ggplot2)
ggplot(mba, aes(n_row, median_time*1e-6, group = expr, colour = expr)) + 
  geom_point() + geom_smooth(se = FALSE) + 
  scale_x_log10(breaks = NROW(director) * n_rep) + scale_y_log10() + 
  xlab("number of rows") + ylab("median of execution time [ms]") +
  ggtitle("microbenchmark results") + theme_bw()

会话信息和程序版本(摘要)

devtools::session_info()
#Session info
# version  R version 3.3.2 (2016-10-31)
# system   x86_64, mingw32
#Packages
# data.table      * 1.10.4  2017-02-01 CRAN (R 3.3.2)
# dplyr             0.5.0   2016-06-24 CRAN (R 3.3.1)
# forcats           0.2.0   2017-01-23 CRAN (R 3.3.2)
# ggplot2         * 2.2.1   2016-12-30 CRAN (R 3.3.2)
# magrittr        * 1.5     2014-11-22 CRAN (R 3.3.0)
# microbenchmark    1.4-2.1 2015-11-25 CRAN (R 3.3.3)
# scales            0.4.1   2016-11-09 CRAN (R 3.3.2)
# splitstackshape   1.4.2   2014-10-23 CRAN (R 3.3.3)
# tidyr             0.6.1   2017-01-10 CRAN (R 3.3.2)

1这篇热情洋溢的评论引起了我的好奇心。太棒了!快了几个数量级!它是对一个问题tidyverse答案,而这个问题被认为是重复的。


不错!看起来cSplit和separate_rows还有提升的空间(它们是专门设计用于此目的的)。顺便说一下,cSplit还带有一个fixed=参数,并且是基于data.table的包,所以最好给它DT而不是DF。另外,我认为将因子转换为字符不应该出现在基准测试中(因为它本来就应该是字符)。我检查了一下,这些更改对结果没有任何定性影响。 - Frank
1
@Frank 感谢您的建议,以改进基准测试,并检查结果的影响。等到 data.tabledplyr 等版本发布后进行更新时,我们会采纳这些建议。 - Uwe
我认为这些方法并不可比较,至少在某些情况下是如此,因为datatable方法仅生成具有“选定”列的表格,而dplyr生成包含所有列的结果(包括未参与分析的列,而无需在函数中编写它们的名称)。 - Ferroao
6
@Ferroao 这是错误的,data.tables方法会就地修改“表格”,所有列都被保留,如果您不在原地修改,则只会得到所要求的筛选副本。简而言之,data.table的方法不是生成一个结果数据集,而是更新数据集,这才是data.table和dplyr之间真正的区别。 - Tensibai
1
非常好的比较!也许你可以在使用 strsplit 的时候加入 matt_modjaap_dplyr,并且设置 fixed=TRUE。因为其他人都这么做了,这会对时间产生影响。自从 R 4.0.0 开始,默认情况下创建 data.frame 时,stringsAsFactors = FALSE,所以可以省略 as.character - GKi

52

将你的原始数据框命名为v,我们有以下内容:

> s <- strsplit(as.character(v$director), ',')
> data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))
                      director AB
1                 Aaron Blaise  A
2                   Bob Walker  A
3               Akira Kurosawa  B
4               Alan J. Pakula  A
5                  Alan Parker  A
6           Alejandro Amenabar  B
7  Alejandro Gonzalez Inarritu  B
8  Alejandro Gonzalez Inarritu  B
9             Benicio Del Toro  B
10 Alejandro González Iñárritu  A
11                 Alex Proyas  B
12              Alexander Hall  A
13              Alfonso Cuaron  B
14            Alfred Hitchcock  A
15              Anatole Litvak  A
16              Andrew Adamson  B
17                 Marilyn Fox  B
18              Andrew Dominik  B
19              Andrew Stanton  B
20              Andrew Stanton  B
21                 Lee Unkrich  B
22              Angelina Jolie  B
23              John Stevenson  B
24               Anne Fontaine  B
25              Anthony Harvey  A

请注意使用rep来构建新的AB列。在这里,sapply返回每个原始行中名称的数量。


1
我在想,相对于更加晦涩难懂的vapply函数,AB=rep(v$AB, unlist(sapply(s, FUN=length )))是否更容易理解?在这里,有没有什么地方需要用到vapply函数才更加适合? - IRTFM
8
现今可以用lengths(s)替代sapply(s, length) - Rich Scriven

32

虽然来晚了,但是一个常见的替代方法是使用我的"splitstackshape"包中的cSplit函数,它有一个direction参数。将其设置为"long"即可获得您指定的结果:

library(splitstackshape)
head(cSplit(mydf, "director", ",", direction = "long"))
#              director AB
# 1:       Aaron Blaise  A
# 2:         Bob Walker  A
# 3:     Akira Kurosawa  B
# 4:     Alan J. Pakula  A
# 5:        Alan Parker  A
# 6: Alejandro Amenabar  B

2
devtools::install_github("yikeshu0611/onetree")

library(onetree)

dd=spread_byonecolumn(data=mydata,bycolumn="director",joint=",")

head(dd)
            director AB
1       Aaron Blaise  A
2         Bob Walker  A
3     Akira Kurosawa  B
4     Alan J. Pakula  A
5        Alan Parker  A
6 Alejandro Amenabar  B

2
另一个基准测试使用来自basestrsplit得出结果,目前可以推荐用它来将列中的逗号分隔字符串拆分成单独的行,因为在各种规模下它是最快的:
s <- strsplit(v$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))

请注意,使用fixed=TRUE会对时间产生重大影响。

Curves showing computation time over number of rows

对比方法:
met <- alist(base = {s <- strsplit(v$director, ",") #Matthew Lundberg
   s <- data.frame(director=unlist(s), AB=rep(v$AB, sapply(s, FUN=length)))}
 , baseLength = {s <- strsplit(v$director, ",") #Rich Scriven
   s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))}
 , baseLeFix = {s <- strsplit(v$director, ",", fixed=TRUE)
   s <- data.frame(director=unlist(s), AB=rep(v$AB, lengths(s)))}
 , cSplit = s <- cSplit(v, "director", ",", direction = "long") #A5C1D2H2I1M1N2O1R2T1
 , dt = s <- setDT(v)[, lapply(.SD, function(x) unlist(tstrsplit(x, "," #Jaap
   , fixed=TRUE))), by = AB][!is.na(director)]
#, dt2 = s <- setDT(v)[, strsplit(director, "," #Jaap #Only Unique
#  , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
 , dplyr = {s <- v %>%  #Jaap
    mutate(director = strsplit(director, ",", fixed=TRUE)) %>%
    unnest(director)}
 , tidyr = s <- separate_rows(v, director, sep = ",") #Jaap
 , stack = s <- stack(setNames(strsplit(v$director, ",", fixed=TRUE), v$AB)) #Jaap
#, dt3 = {s <- setDT(v)[, strsplit(director, ",", fixed=TRUE), #Uwe #Only Unique
#  by = .(AB, director)][, director := NULL][, setnames(.SD, "V1", "director")]}
 , dt4 = {s <- setDT(v)[, .(director = unlist(strsplit(director, "," #Uwe
   , fixed = TRUE))), by = .(AB)]}
 , dt5 = {s <- vT[, .(director = unlist(strsplit(director, "," #Uwe
   , fixed = TRUE))), by = .(AB)]}
   )

图书馆:
library(microbenchmark)
library(splitstackshape) #cSplit
library(data.table) #dt, dt2, dt3, dt4
#setDTthreads(1) #Looks like it has here minor effect
library(dplyr) #dplyr
library(tidyr) #dplyr, tidyr

数据:
v0 <- data.frame(director = c("Aaron Blaise,Bob Walker", "Akira Kurosawa", 
                        "Alan J. Pakula", "Alan Parker", "Alejandro Amenabar", "Alejandro Gonzalez Inarritu", 
                        "Alejandro Gonzalez Inarritu,Benicio Del Toro", "Alejandro González Iñárritu", 
                        "Alex Proyas", "Alexander Hall", "Alfonso Cuaron", "Alfred Hitchcock", 
                        "Anatole Litvak", "Andrew Adamson,Marilyn Fox", "Andrew Dominik", 
                        "Andrew Stanton", "Andrew Stanton,Lee Unkrich", "Angelina Jolie,John Stevenson", 
                        "Anne Fontaine", "Anthony Harvey"), AB = c('A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'A'))

计算和定时结果:
n <- 10^(0:5)
x <- lapply(n, function(n) {v <- v0[rep(seq_len(nrow(v0)), n),]
  vT <- setDT(v)
  ti <- min(100, max(3, 1e4/n))
  microbenchmark(list = met, times = ti, control=list(order="block"))})

y <- do.call(cbind, lapply(x, function(y) aggregate(time ~ expr, y, median)))
y <- cbind(y[1], y[-1][c(TRUE, FALSE)])
y[-1] <- y[-1] / 1e6 #ms
names(y)[-1] <- paste("n:", n * nrow(v0))
y #Time in ms
#         expr     n: 20    n: 200    n: 2000   n: 20000   n: 2e+05   n: 2e+06
#1        base 0.2989945 0.6002820  4.8751170  46.270246  455.89578  4508.1646
#2  baseLength 0.2754675 0.5278900  3.8066300  37.131410  442.96475  3066.8275
#3   baseLeFix 0.2160340 0.2424550  0.6674545   4.745179   52.11997   555.8610
#4      cSplit 1.7350820 2.5329525 11.6978975  99.060448 1053.53698 11338.9942
#5          dt 0.7777790 0.8420540  1.6112620   8.724586  114.22840  1037.9405
#6       dplyr 6.2425970 7.9942780 35.1920280 334.924354 4589.99796 38187.5967
#7       tidyr 4.0323765 4.5933730 14.7568235 119.790239 1294.26959 11764.1592
#8       stack 0.2931135 0.4672095  2.2264155  22.426373  289.44488  2145.8174
#9         dt4 0.5822910 0.6414900  1.2214470   6.816942   70.20041   787.9639
#10        dt5 0.5015235 0.5621240  1.1329110   6.625901   82.80803   636.1899

注意,像这样的方法
(v <- rbind(v0[1:2,], v0[1,]))
#                 director AB
#1 Aaron Blaise,Bob Walker  A
#2          Akira Kurosawa  B
#3 Aaron Blaise,Bob Walker  A

setDT(v)[, strsplit(director, "," #Jaap #Only Unique
  , fixed=TRUE), by = .(AB, director)][,.(director = V1, AB)]
#         director AB
#1:   Aaron Blaise  A
#2:     Bob Walker  A
#3: Akira Kurosawa  B

返回一个针对唯一的“导演”进行 strsplit 的结果,并可能可比较。
tmp <- unique(v)
s <- strsplit(tmp$director, ",", fixed=TRUE)
s <- data.frame(director=unlist(s), AB=rep(tmp$AB, lengths(s)))

但据我理解,这并没有被要求。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接