如何从向量中删除任何子列表元素的共现（R）

Question

如何从向量中删除任何子列表元素的共现（R）

4

我查看了Python问题如何从列表中删除子列表的每个出现。现在我想知道R中有多少种创造性的方法。
例如，从main_list中删除任何sub_list的出现。

main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)

期望结果：2 3 4 2 2 1

我的建议：

a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1] 2 3 4 2 2 1

2

as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))

哦，这真的很危险。让我们试试：

main_list = c(2, 1, 2, 3, 12, 1, 2, 4, 2, 2, 1)
as.numeric(unlist(strsplit(gsub("(12)","",paste0(main_list,collapse = "")),split = "")))
[1] 2 3 4 2 2 1
####However 
a<-c()
for(i in 1:(length(main_list)-1)){
if (all(main_list[c(i,i+1)]==sub_list))
{a<-c(a,c(i,i+1))}
}
main_list[-a]
[1]  2  3 12  4  2  2  1

2018年9月8日更新

基准测试方案：

我根据内存和时间评估了各个解决方案的优劣，使用了一个大型向量数字，并使用了profmem和microbenchmark库。

set.seed(1587)
main_list<-sample(c(8:13,102:105),size = 10000000,replace = T)
main_list<-c(c(8,9,12,103),main_list,c(8,9,12,103))   
sub_list<-c(8,9,12,103)

d.b的解决方案对于main_list无效，因此我进行了如下修改：

ML = paste(main_list, collapse = ",")  # collapse should not be empty
SL = paste(sub_list, collapse = ",")
out<-gsub(SL, "", ML)
out<-gsub("^\\,","",out)
out<-gsub("\\,$","",out)
out<-gsub("\\,,","\\,",out)
out<-as.numeric(unlist(strsplit(out,split = ",")))

  solution       seconds memory_byte memory_base seconds_base
  <chr>            <dbl>       <dbl>       <dbl>        <dbl>
1 d.b              26.0    399904560        1           16.8 
2 Grothendieck_2    1.55  1440070304        3.60         1   
3 Grothendieck_1  109.    4968036376       12.4         70.3 
4 李哲源            2.17  1400120824        3.50         1.40

有关基准测试的任何评论？

- Iman

开放性问题是否能提高创造力？ - Iman

2个回答

4

这里有两种解决方案。第一种显然更简单，适用于您偏爱清晰易懂且易于维护的情况，而第二种则没有包依赖，并且速度更快。

1) zoo 使用滑动窗口来比较每个长度为所需长度的子序列（main_list，sub_list）和sub_list。 (我们附加sub_list以确保始终存在要删除的内容。) 这个语句根据当前位置是否是匹配子序列的结尾返回TRUE或FALSE。然后计算真正的索引号，从而得到所有要删除元素的索引并将其删除。

library(zoo)

w <- length(sub_list)
r <- rollapplyr(c(main_list, sub_list), w, identical, sub_list, fill = FALSE)
main_list[-c(outer(which(r), seq_len(w) - 1, "-"))]
## [1] 2 3 4 2 2 1

2) 基础 R。中间的r设置与(1)中相应的行具有相同的目的，最后一行与(2)中的最后一行相同，只是由于embed实际上使用左对齐，我们使用+而不是-。

w <- length(sub_list)
r <- colSums(t(embed(c(main_list, sub_list), w)) == rev(sub_list)) == w
main_list[-c(outer(which(r), seq_len(w) - 1, "+"))]
## [1] 2 3 4 2 2 1

- G. Grothendieck

是的，embed 是一个很好的函数：https://dev59.com/9Z7ha4cB1Zd3GeqPkYLH#41839433，唯一的不便之处是我们需要一个 rev。您确定不需要通过测试 length(ind) > 0L 来保护 x[-ind] 吗？ - Zheyuan Li

只要sub_list的长度严格为正，上述任一解决方案中最后一行的索引长度也是严格为正的，因此无需测试它。如果想要处理这种边缘情况，可以测试sub_list是否为零长度。 - G. Grothendieck

是的，当我添加第二个解决方案时，我修改了第一个解决方案以使用相同的技巧。 - G. Grothendieck

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zheyuan Li · Accepted Answer

这里有一个执行这个通用功能的函数。

xm 是包含整数、字符或逻辑值的主列表；
xs 是包含整数、字符或逻辑值的子列表。

目前没有进行检查，要求length(xm) > length(xs)。

foo <- function (xm, xs) {
  nm <- length(xm)
  ns <- length(xs)
  shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
  d <- xm[shift_ind] == xs
  first_drop_ind <- which(.colSums(d, ns, length(d) / ns) == ns)
  if (length(first_drop_ind) > 0L) {
    drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
    return(xm[-drop_ind])
    } else {
    return(xm)
    }
  }

main_list = c(2, 1, 2, 3, 1, 2, 4, 2, 2 ,1)
sub_list = c(1,2)
foo(main_list, sub_list)
#[1] 2 3 4 2 2 1

Explanation

解释

xm <- main_list
xs <- sub_list

nm <- length(xm)
ns <- length(xs)
shift_ind <- outer(0:(ns - 1), 1:(nm - ns + 1), "+")
MAT <- matrix(xm[shift_ind], ns)
#     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
#[1,]    2    1    2    3    1    2    4    2    2
#[2,]    1    2    3    1    2    4    2    2    1

所以第一步是进行移位和矩阵表示，如上所述。

LOGIC <- MAT == xs
#      [,1] [,2]  [,3]  [,4] [,5]  [,6]  [,7]  [,8]  [,9]
#[1,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE TRUE FALSE  TRUE  TRUE FALSE

如果发现共现，那么一列应该包含所有的TRUE，即colSums应该是ns。这样我们就可以确定匹配的第一个值的位置。

first_drop_ind <- which(colSums(LOGIC) == ns)
#[1] 2 5

现在我们需要扩展它以涵盖这些初始匹配后的随后数值。

drop_ind <- outer(0:(ns - 1), first_drop_ind, "+")
#     [,1] [,2]
#[1,]    2    5
#[2,]    3    6

最后，我们从 xm 中删除这些位置的值：

xm[-drop_ind]
#[1] 2 3 4 2 2 1

请注意，在该函数中，矩阵并没有被显式地形成。使用了.colSums而不是colSums。

注意错误。

函数中的if ... else ...是必需的。如果没有找到匹配项，则drop_ind将是integer(0)，使用xm[-drop_ind]会得到xm[integer(0)]，它是integer(0)。

与zoo::rollapplyr的比较

## require package `zoo`
bar <- function (xm, xs) {
  w <- length(xs)
  r <- rollapplyr(xm, w, identical, xs, fill = FALSE)
  if (length(r) > 0L) {
    return(xm[-c(outer(which(r), seq_len(w) - 1, "-"))])
    } else {
    return(xm)
    }
  }

set.seed(0)
xm <- sample.int(10, 10000, TRUE)
xs <- 1:2

library(zoo)

system.time(a <- foo(xm, xs))
#   user  system elapsed 
#  0.004   0.000   0.001 

system.time(b <- bar(xm, xs))
#   user  system elapsed 
#  0.276   0.000   0.273 

all.equal(a, b)
#[1] TRUE

我猜 rollapplyr 更慢的原因是因为：

它需要先将 xm 强制转换为 "zoo" 对象；
内部使用 lapply，导致 R 和 C 之间频繁跳转。