我正在寻找一个gsub字符串,它将返回表达式的所有匹配项,而不仅仅是最后一个匹配项。例如:
data <- list("a sentence with citation (Ref. 12) and another (Ref. 13)", "single (Ref. 14)")
gsub(".*(Ref. (\\d+)).*", "\\1", data)
返回
[1] "Ref. 13" "Ref. 14"
我丢失了参考文献12。
gsubfn
包中的 strapply
函数来实现此操作:library(gsubfn)
data <- list("a sentence with citation (Ref. 12) and another (Ref. 13)", "single (Ref. 14)")
unlist(strapply(data,"(Ref. (\\d+))"))
sapply(data,stringr::str_extract_all,pattern="Ref. (\\d+))")
?
这里有一个函数,本质上是gregexpr()
的包装器,可以从单个字符串中捕获多个引用。
extractMatches <- function(data, pattern) {
start <- gregexpr(pattern, data)[[1]]
stop <- start + attr(start, "match.length") - 1
if(-1 %in% start) {
"" ## **Note** you could return NULL if there are no matches
} else {
mapply(substr, start, stop, MoreArgs = list(x = data))
}
}
data <- list("a sentence with citation (Ref. 12), (Ref. 13), and then (Ref. 14)",
"another sentence without reference")
pat <- "Ref. (\\d+)"
res <- lapply(data, extractMatches, pattern = pat)
res
# [[1]]
# [1] "Ref. 12" "Ref. 13" "Ref. 14"
#
# [[2]]
# [1] ""
NULL
而不是""
,那么您可以使用do.call("c",res)
后处理结果,以获取仅包含匹配引用的单个向量。)我之前也遇到过非常类似的问题(http://thebiobucket.blogspot.com/2012/03/how-to-extract-citation-from-body-of.html),并提出了这个解决方案(实际上非常类似于Ben的):
require(stringr)
unlist(str_extract_all(unlist(data), pattern = "\\(.*?\\)"))
提供:
[1] "(Ref. 12)" "(Ref. 13)" "(Ref. 14)"
str_extract_all
反过来调用了str_locate_all
,后者调用了re_mapply("gregexpr", string, pattern)
——这是我能想象出的最好的函数伪代码总结)。 - Josh O'Brien