将一个字符串列拆分为多个虚拟变量

Question

将一个字符串列拆分为多个虚拟变量

11

作为R中data.table包相对不太熟练的用户，我一直在尝试将一个文本列处理成大量的指示器列（虚拟变量），其中每个列中的1表示在字符串列中找到了特定的子字符串。例如，我想处理这个：

ID     String  
1       a$b  
2       b$c  
3       c

转化为以下内容：

ID     String     a     b     c  
1       a$b       1     1     0  
2       b$c       0     1     1  
3        c        0     0     1

我已经想出如何处理数据，但它运行的时间比我想象中要长，而且我怀疑我的代码效率不高。下面是一个包含虚拟数据的可复现代码版本。请注意，在真实数据中，有2000多个子字符串需要搜索，每个子字符串大约有30个字符长，并且可能有数百万行数据。如果需要，我可以并行化处理并投入很多资源，但我想尽可能地优化代码。我已经尝试过运行Rprof，但没有发现明显的（对我来说）改进。

set.seed(10)  
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))  
random_string <- function(min_length, max_length, separator) {  
    selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)  
    return(selection)  
}  
dt <- data.table(id = c(1:1000), messy_string = "")  
dt[ , messy_string := random_string(2, 5, "$"), by = id]  
create_indicators <- function(search_list, searched_string) {  
    y <- rep(0, length(search_list))  
    for(j in 1:length(search_list)) {  
        x <- regexpr(search_list[j], searched_string)  
        x <- x[1]  
        y[j] <- ifelse(x > 0, 1, 0)  
    }  
    return(y)  
}  
timer <- proc.time()  
indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list))  
for(n in 1:nrow(dt)) {  
    indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)]  
}  
indicators <- data.table(indicators)  
setnames(indicators, elements_list)  
dt <- cbind(dt, indicators)  
proc.time() - timer  

user  system elapsed 
13.17    0.08   13.29

编辑

感谢大家提供的出色回答--都比我使用的方法好得多。下面是一些速度测试的结果，每个函数都稍作修改，在我的代码中使用0L和1L，按方法将结果存储在单独的表中，并对顺序进行了标准化。这些是从单个速度测试中获得的经过时间（而不是从许多测试中获得的中位数），但较大的运行时间很长。

Number of rows in dt     2K      10K      50K     250K      1M   
OP                       28.6    149.2    717.0   
eddi                     5.1     24.6     144.8   1950.3  
RS                       1.8     6.7      29.7    171.9     702.5  
Original GT              1.4     7.4      57.5    809.4   
Modified GT              0.7     3.9      18.1    115.2     473.9  
GT4                      0.1     0.4      2.26    16.9      86.9

很明显，改进版的GeekTrader方法是最好的。我对每个步骤的具体作用还有点模糊，但我可以慢慢理解。虽然有些超出了原问题的范围，如果有人能更有效地解释GeekTrader和Ricardo Saporta的方法，那么我和未来访问此页面的其他人都会感激。我特别想了解为什么有些方法比其他方法更具可扩展性。

*****编辑＃2*****

我试图通过这条评论编辑GeekTrader的答案，但似乎不起作用。我对GT3函数进行了两个非常小的修改，a）对列进行排序，这会增加一些时间，b）将0和1替换为0L和1L，这会加快速度。将结果称为GT4函数。上面的表已编辑以添加GT4在不同表大小下的时间。显然是最佳选择，而且具有直观性的优势。

- user2262318

1

更新至版本3，速度更快，内存使用更高效。 - CHP

这是一个非常好的问题，有很棒的答案。在你的基准测试中，“Modified GT”是指GT3吗？如果是的话，当我通过将0和1更改为0L和1L来实现GT4时，我无法获得10倍的加速。 - mchangun

6个回答

4

  # split the `messy_string` and create a long table, keeping track of the id
  DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val")

  # add the columns, initialize to 0
  DT2[, c(elements_list) := 0L]
  # warning expected, re:adding large ammount of columns


  # iterate over each value in element_list, assigning 1's ass appropriate
  for (el in elements_list)
     DT2[el, c(el) := 1L]

  # sum by ID
  DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]

请注意，我们将携带messy_string列，因为将其留下然后通过ID join 获取它比丢弃它便宜。如果您不需要在最终输出中使用它，请在上面删除它。

基准测试：

创建示例数据：

# sample data, using OP's exmple
set.seed(10)
N <- 1e6  # number of rows
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))  
messy_string_vec <- random_string_fast(N, 2, 5, "$")   # Create the messy strings in a single shot. 
masterDT <- data.table(ID = c(1:N), messy_string = messy_string_vec, key="ID")   # create the data.table

小贴士 一次性创建随机字符串并将结果作为单列分配要比调用函数N次并逐个分配每个字符串的速度快得多。

  # Faster way to create the `messy_string` 's
  random_string_fast <- function(N, min_length, max_length, separator) {  
    ints <- seq(from=min_length, to=max_length)
    replicate(N, paste(sample(elements_list, sample(ints)), collapse=separator))
  }

比较四种方法：

这个答案 -- "DT.RS"
@eddi的答案 -- "Plyr.eddi"
@GeekTrader的答案 -- DT.GT
GeekTrader的答案带有一些修改 -- DT.GT_Mod

以下是设置：

library(data.table); library(plyr); library(microbenchmark)

# data.table method - RS
usingDT.RS <- quote({DT <- copy(masterDT);
                    DT2 <- setkey(DT[, list(val=unlist(strsplit(messy_string, "\\$"))), by=list(ID, messy_string)], "val"); DT2[, c(elements_list) := 0L]
                    for (el in elements_list) DT2[el, c(el) := 1L]; DT2[, lapply(.SD, sum), by=list(ID, messy_string), .SDcols=elements_list]})

# data.table method - GeekTrader
usingDT.GT <- quote({dt <- copy(masterDT); myFunc()})

# data.table method - GeekTrader, modified by RS
usingDT.GT_Mod <- quote({dt <- copy(masterDT); myFunc.modified()})

# ply method from below
usingPlyr.eddi <- quote({dt <- copy(masterDT); indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i) dt[i, data.frame(t(as.matrix(table(strsplit(messy_string, split = "\\$"))))) ])); 
                    dt = cbind(dt, indicators); dt[is.na(dt)] = 0; dt })

这里是基准测试结果：

microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), usingPlyr.eddi=eval(usingPlyr.eddi), times=5L)


  On smaller data: 

  N = 600
  Unit: milliseconds
              expr       min        lq    median        uq       max
  1     usingDT.GT 1189.7549 1198.1481 1200.6731 1202.0972 1203.3683
  2 usingDT.GT_Mod  581.7003  591.5219  625.7251  630.8144  650.6701
  3     usingDT.RS 2586.0074 2602.7917 2637.5281 2819.9589 3517.4654
  4 usingPlyr.eddi 2072.4093 2127.4891 2225.5588 2242.8481 2349.6086


  N = 1,000 
  Unit: seconds
       expr      min       lq   median       uq      max
  1 usingDT.GT 1.941012 2.053190 2.196100 2.472543 3.096096
  2 usingDT.RS 3.107938 3.344764 3.903529 4.010292 4.724700
  3  usingPlyr 3.297803 3.435105 3.625319 3.812862 4.118307

  N = 2,500
  Unit: seconds
              expr      min       lq   median       uq       max
  1     usingDT.GT 4.711010 5.210061 5.291999 5.307689  7.118794
  2 usingDT.GT_Mod 2.037558 2.092953 2.608662 2.638984  3.616596
  3     usingDT.RS 5.253509 5.334890 6.474915 6.740323  7.275444
  4 usingPlyr.eddi 7.842623 8.612201 9.142636 9.420615 11.102888

  N = 5,000
              expr       min        lq    median        uq       max
  1     usingDT.GT  8.900226  9.058337  9.233387  9.622531 10.839409
  2 usingDT.GT_Mod  4.112934  4.293426  4.460745  4.584133  6.128176
  3     usingDT.RS  8.076821  8.097081  8.404799  8.800878  9.580892
  4 usingPlyr.eddi 13.260828 14.297614 14.523016 14.657193 16.698229

  # dropping the slower two from the tests:
  microbenchmark( usingDT.RS=eval(usingDT.RS), usingDT.GT=eval(usingDT.GT), usingDT.GT_Mod=eval(usingDT.GT_Mod), times=6L)

  N = 10,000
  Unit: seconds
              expr       min        lq    median        uq       max
  1 usingDT.GT_Mod  8.426744  8.739659  8.750604  9.118382  9.848153
  2     usingDT.RS 15.260702 15.564495 15.742855 16.024293 16.249556

  N = 25,000
  ... (still running)

-----------------

用于基准测试的函数：

  # original random string function
  random_string <- function(min_length, max_length, separator) {  
      selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)  
      return(selection)  
  }  

  # GeekTrader's function
  myFunc <- function() {
    ll <- strsplit(dt[,messy_string], split="\\$")


    COLS <- do.call(rbind, 
                    lapply(1:length(ll), 
                           function(i) {
                             data.frame(
                               ID= rep(i, length(ll[[i]])),
                               COL = ll[[i]], 
                               VAL= rep(1, length(ll[[i]]))
                               )
                             }
                           )
                    )

    res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ))
    dt <- cbind(dt, res)
    for (j in names(dt))
      set(dt,which(is.na(dt[[j]])),j,0)
    return(dt)
  }


  # Improvements to @GeekTrader's `myFunc` -RS  '
  myFunc.modified <- function() {
    ll <- strsplit(dt[,messy_string], split="\\$")

    ## MODIFICATIONS: 
    # using `rbindlist` instead of `do.call(rbind.. )`
    COLS <- rbindlist( lapply(1:length(ll), 
                           function(i) {
                             data.frame(
                               ID= rep(i, length(ll[[i]])),
                               COL = ll[[i]], 
                               VAL= rep(1, length(ll[[i]])), 
  # MODICIATION:  Not coercing to factors                             
                               stringsAsFactors = FALSE
                               )
                             }
                           )
                    )

  # MODIFICATION: Preserve as matrix, the output of tapply
    res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )

  # FLATTEN into a data.table
    resdt <- data.table(r=c(res2))

  # FIND & REPLACE NA's of single column
    resdt[is.na(r), r:=0L]

  # cbind with dt, a matrix, with the same attributes as `res2`  
    cbind(dt, 
          matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2)))
  }


  ### Benchmarks comparing the two versions of GeekTrader's function: 
  orig = quote({dt <- copy(masterDT); myFunc()})
  modified = quote({dt <- copy(masterDT); myFunc.modified()})
  microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L)

  #  Unit: milliseconds
  #        expr      min        lq   median       uq      max
  #  1 Modified  895.025  971.0117 1011.216 1189.599 2476.972
  #  2     Orig 1953.638 2009.1838 2106.412 2230.326 2356.802

- Ricardo Saporta

@geektrader，好的，稍等一下。 - Ricardo Saporta

data.table 的解决方案很棒，但为什么新的示例中出现了新变量 findMe？在您的基准测试中应该是 elements_list，在我的电脑上它比带有 NA 替换的 plyr 解决方案快约 3 倍，并且比没有 NA 替换的 plyr 快约 20%。 - eddi

此外，如果您在公式中指定0和1为整数（即 0L和 1L），则 data.table解决方案会稍微更快。 - eddi

我将研究GeekTrader的解决方案，但我的测试表明这个解决方案可以很好地扩展（与行数成线性关系），这一点非常重要。 - user2262318

下面发布了一种更近期的方法，或许可以加入到基准测试中？ - mtoto

显示剩余6条评论

3

这里是一个比较新的方法，使用splitstackshape包中的cSplit_e()函数。

library(splitstackshape)
cSplit_e(dt, split.col = "String", sep = "$", type = "character", 
         mode = "binary", fixed = TRUE, fill = 0)
#  ID String String_a String_b String_c
#1  1    a$b        1        1        0
#2  2    b$c        0        1        1
#3  3      c        0        0        1

- mtoto

2

这是一个使用rbind.fill 10倍更快的版本。

library(plyr)
indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i)
                        dt[i,
                           data.frame(t(as.matrix(table(strsplit(messy_string,
                                                                 split = "\\$")))))
                          ]))
dt = cbind(dt, indicators)

# dt[is.na(dt)] = 0
# faster NA replace (thanks geektrader)
for (j in names(dt))
  set(dt, which(is.na(dt[[j]])), j, 0L)

- eddi

你好，不错的解决方案。不过，看起来可能有些不准确。请查看输出结果。 - Ricardo Saporta

如果你说的是列的顺序不同，那就是这个问题了。 - eddi

这看起来很有前途，在我的机器上，最后一步的时间印记可以忽略不计；我得到了10倍的加速改进，即使包括它。我确实需要以特定顺序排列列，但我可以在最后重新排序（我认为这应该很快，但还没有尝试过）。然而，对我来说，这个解决方案似乎也可能随着行数的增加而扩展性差。我目前正在运行一个大型测试，以查看许多更多行时收益的消散程度。 - user2262318

在我的测试中，它随着行数呈线性扩展。 - eddi

@eddi，非常好的plyr解决方案！ :) - Arun

2

以下是使用rapply和table的一种方法。我相信还有比在这里使用table更快的方法，但它仍然比@ricardo的myfunc.Modified稍微快一点。

# a copy with enough column pointers available
dtr <- alloc.col(copy(dt)  ,1000L)

rapplyFun <- function(){
ll <- strsplit(dtr[, messy_string], '\\$')
Vals <- rapply(ll, classes = 'character', f= table, how = 'replace')
Names <- unique(rapply(Vals, names))

dtr[, (Names) := 0L]
for(ii in seq_along(Vals)){
  for(jj in names(Vals[[ii]])){
    set(dtr, i = ii, j = jj, value =Vals[[ii]][jj])
  }
}
}


microbenchmark(myFunc.modified(), rapplyFun(),times=5)
Unit: milliseconds
#             expr      min       lq   median       uq      max neval
# myFunc.modified() 395.1719 396.8706 399.3218 400.6353 401.1700     5
# rapplyFun()       308.9103 309.5763 309.9368 310.2971 310.3463     5

- mnel

1

这里有另一种解决方案，它构建了一个稀疏矩阵对象，而不是你所拥有的东西。这样可以节省很多时间和内存。它产生有序的结果，即使转换为data.table，也比使用0L和1L的GT3更快，而且不需要重新排序（这可能是因为我使用了不同的方法来到达所需的坐标 - 我没有经过GT3算法），但如果你不转换并将其保留为稀疏矩阵，则比GT3快10-20倍（并且具有更小的内存占用）。

library(Matrix)

strings = strsplit(dt$messy_string, split = "$", fixed = TRUE)
element.map = data.table(el = elements_list, n = seq_along(elements_list), key = "el")

tmp = data.table(n = seq_along(strings), each = unlist(lapply(strings, length)))

rows = tmp[, rep(n, each = each), by = n][, V1]
cols = element.map[J(unlist(strings))][,n]

dt.sparse = sparseMatrix(rows, cols, x = 1,
                         dims = c(max(rows), length(elements_list)))

# optional, should be avoided until absolutely necessary
dt = cbind(dt, as.data.table(as.matrix(dt.sparse)))
setnames(dt, c('id', 'messy_string', elements_list))

这个想法是将字符串拆分，然后使用 data.table 作为映射对象，将每个子字符串映射到其正确的列位置。从那里开始，只需要找出正确的行并填充矩阵即可。

- eddi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CHP · Accepted Answer

更新：版本3

发现了更快的方法。这个函数也非常高效地利用了内存。之前的函数速度较慢，主要原因是在lapply循环中发生了复制/赋值，以及结果的rbinding。在下一个版本中，我们使用适当大小的预分配矩阵，然后更改适当坐标处的值，这使得它与其他循环版本相比非常快速。

funcGT3 <- function() {
    #Get list of column names in result
    resCol <- unique(dt[, unlist(strsplit(messy_string, split="\\$"))])

    #Get dimension of result
    nresCol <- length(resCol)
    nresRow <- nrow(dt)

    #Create empty matrix with dimensions same as desired result
    mat <- matrix(rep(0, nresRow * nresCol), nrow = nresRow, dimnames = list(as.character(1:nresRow), resCol))

    #split each messy_string by $
    ll <- strsplit(dt[,messy_string], split="\\$")

    #Get coordinates of mat which we need to set to 1
    coords <- do.call(rbind, lapply(1:length(ll), function(i) cbind(rep(i, length(ll[[i]])), ll[[i]] )))

    #Set mat to 1 at appropriate coordinates
    mat[coords] <- 1    

    #Bind the mat to original data.table
    return(cbind(dt, mat))

}


result <- funcGT3()  #result for 1000 rows in dt
result
        ID   messy_string zn tc sv db yx st ze qs wq oe cv ut is kh kk im le qg rq po wd kc un ft ye if zl zt wy et rg iu
   1:    1 zn$tc$sv$db$yx  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   2:    2    st$ze$qs$wq  0  0  0  0  0  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   3:    3    oe$cv$ut$is  0  0  0  0  0  0  0  0  0  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   4:    4 kh$kk$im$le$qg  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   5:    5    rq$po$wd$kc  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  1  1  1  0  0  0  0  0  0  0  0  0  0
  ---                                                                                                                    
 996:  996    rp$cr$tb$sa  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 997:  997    cz$wy$rj$he  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0
 998:  998       cl$rr$bm  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 999:  999    sx$hq$zy$zd  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
1000: 1000    bw$cw$pw$rq  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0

对照Ricardo建议的版本2进行基准测试（针对包含250K行数据的情况）：

Unit: seconds
 expr       min        lq    median        uq       max neval
  GT2 104.68672 104.68672 104.68672 104.68672 104.68672     1
  GT3  15.15321  15.15321  15.15321  15.15321  15.15321     1

版本 1 以下是建议答案的版本 1。

set.seed(10)  
elements_list <- c(outer(letters, letters, FUN = paste, sep = ""))  
random_string <- function(min_length, max_length, separator) {  
  selection <- paste(sample(elements_list, ceiling(runif(1, min_length, max_length))), collapse = separator)  
  return(selection)  
}  
dt <- data.table(ID = c(1:1000), messy_string = "")  
dt[ , messy_string := random_string(2, 5, "$"), by = ID]  


myFunc <- function() {
  ll <- strsplit(dt[,messy_string], split="\\$")


  COLS <- do.call(rbind, 
                  lapply(1:length(ll), 
                         function(i) {
                           data.frame(
                             ID= rep(i, length(ll[[i]])),
                             COL = ll[[i]], 
                             VAL= rep(1, length(ll[[i]]))
                             )
                           }
                         )
                  )

  res <- as.data.table(tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length ))
  dt <- cbind(dt, res)
  for (j in names(dt))
    set(dt,which(is.na(dt[[j]])),j,0)
  return(dt)
}


create_indicators <- function(search_list, searched_string) {  
  y <- rep(0, length(search_list))  
  for(j in 1:length(search_list)) {  
    x <- regexpr(search_list[j], searched_string)  
    x <- x[1]  
    y[j] <- ifelse(x > 0, 1, 0)  
  }  
  return(y)  
}  
OPFunc <- function() {
indicators <- matrix(0, nrow = nrow(dt), ncol = length(elements_list))  
for(n in 1:nrow(dt)) {  
  indicators[n, ] <- dt[n, create_indicators(elements_list, messy_string)]  
}  
indicators <- data.table(indicators)  
setnames(indicators, elements_list)  
dt <- cbind(dt, indicators)
return(dt)
}



library(plyr)
plyrFunc <- function() {
  indicators = do.call(rbind.fill, sapply(1:dim(dt)[1], function(i)
    dt[i,
       data.frame(t(as.matrix(table(strsplit(messy_string,
                                             split = "\\$")))))
       ]))
  dt = cbind(dt, indicators)
  #dt[is.na(dt)] = 0 #THIS DOESN'T WORK. USING FOLLOWING INSTEAD

  for (j in names(dt))
    set(dt,which(is.na(dt[[j]])),j,0)

  return(dt)  
}

基准测试

system.time(res <- myFunc())
## user  system elapsed 
## 1.01    0.00    1.01

system.time(res2 <- OPFunc())
## user  system elapsed 
## 21.58    0.00   21.61

system.time(res3 <- plyrFunc())
## user  system elapsed 
## 1.81    0.00    1.81

版本2：由Ricardo提出建议

我将此发布在此处而不是我的回答中，因为框架实际上是@GeekTrader的 -Rick_

  myFunc.modified <- function() {
    ll <- strsplit(dt[,messy_string], split="\\$")

    ## MODIFICATIONS: 
    # using `rbindlist` instead of `do.call(rbind.. )`
    COLS <- rbindlist( lapply(1:length(ll), 
                           function(i) {
                             data.frame(
                               ID= rep(i, length(ll[[i]])),
                               COL = ll[[i]], 
                               VAL= rep(1, length(ll[[i]])), 
  # MODICIATION:  Not coercing to factors                             
                               stringsAsFactors = FALSE
                               )
                             }
                           )
                    )

  # MODIFICATION: Preserve as matrix, the output of tapply
    res2 <- tapply(COLS$VAL, list(COLS$ID, COLS$COL), FUN = length )

  # FLATTEN into a data.table
    resdt <- data.table(r=c(res2))

  # FIND & REPLACE NA's of single column
    resdt[is.na(r), r:=0L]

  # cbind with dt, a matrix, with the same attributes as `res2`  
    cbind(dt, 
          matrix(resdt[[1]], ncol=ncol(res2), byrow=FALSE, dimnames=dimnames(res2)))
  }


 ### Benchmarks: 

orig = quote({dt <- copy(masterDT); myFunc()})
modified = quote({dt <- copy(masterDT); myFunc.modified()})
microbenchmark(Modified = eval(modified), Orig = eval(orig), times=20L)

#  Unit: milliseconds
#        expr      min        lq   median       uq      max
#  1 Modified  895.025  971.0117 1011.216 1189.599 2476.972
#  2     Orig 1953.638 2009.1838 2106.412 2230.326 2356.802