R中的最长公共子串：查找两个字符串之间的非连续匹配

Question

R中的最长公共子串：查找两个字符串之间的非连续匹配

25

我有一个关于在R中查找最长公共子串的问题。在StackOverflow上查看了几篇帖子后，我了解到了qualV包。然而，我发现该包中的LCS函数实际上会找出string1中所有存在于string2中的字符，即使它们不是连续的。

举个例子，如果字符串是： string1: "hello" string2: "hel12345lo" 我期望的输出应该是hel，但实际得到的输出却是hello。我一定做错了什么。请看下面的代码：

library(qualV)
a= "hello"
b="hel123l5678o" 
sapply(seq_along(a), function(i)
    paste(LCS(substring(a[i], seq(1, nchar(a[i])), seq(1, nchar(a[i]))),
              substring(b[i], seq(1, nchar(b[i])), seq(1, nchar(b[i]))))$LCS,
          collapse = ""))

我也尝试过Rlibstree方法，但我仍然得到了不连续的子字符串。此外，子字符串的长度也与我的期望不符。请见下方。

> a = "hello"
> b = "h1e2l3l4o5"

> ll <- list(a,b)
> lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x))
$do.call.rbind..ll.
[1] "h" "e" "l" "o"

> nchar(lapply(data.frame(do.call(rbind, ll), stringsAsFactors=FALSE), function(x) getLongestCommonSubstring(x)))
do.call.rbind..ll.
                21

- IAMTubby

2

相关问题：https://dev59.com/7GQo5IYBdhLWcg3wQdet - Andrie

1

@Andrie，我尝试了链接中的Rlibstree方法。但是，我仍然得到了不连续的子字符串。而且匹配子字符串的长度有误。我已经在上面编辑了我的原始帖子，并添加了信息，请看一下。 - IAMTubby

2

澄清一下：qualV的LCS函数并不是找到最长公共子串，而是找到最长公共子序列 - 因此你得到的结果就是这样。这就是子序列的定义。这些问题相关但解决方案相当不同，最长公共子序列问题是计算机科学中更经典的问题，因此更常被实现。 - Konrad Rudolph

7个回答

9

利用 @RichardScriven 的见解， adist 可以用于计算“近似字符串距离”，我制作了一个更全面的函数。请注意，"trafos" 代表用于确定两个字符串之间“距离”的“转换”（示例在底部）。编辑此答案可能会产生错误/意外结果；如 @wdkrnls 所指出的：

引用：

我对 "apple" 和 "big apple bagels" 运行了您的函数，它返回了 "appl"。我本应该期望得到 "apple"。

请参见下面的解释错误的结果。我们从一个函数开始，以获取列表中的longest_string：

longest_string <- function(s){return(s[which.max(nchar(s))])}

然后，我们可以使用@RichardSriven的工作和stringi库：

library(stringi)
lcsbstr <- function(a,b) { 
  sbstr_locations<- stri_locate_all_regex(drop(attr(adist(a, b, counts=TRUE), "trafos")), "M+")[[1]]
  cmn_sbstr<-stri_sub(longest_string(c(a,b)), sbstr_locations)
  longest_cmn_sbstr <- longest_string(cmn_sbstr)
   return(longest_cmn_sbstr) 
}

或者我们可以重写代码，避免使用任何外部库（仍然使用R的本地adist函数）：

lcsbstr_no_lib <- function(a,b) { 
    matches <- gregexpr("M+", drop(attr(adist(a, b, counts=TRUE), "trafos")))[[1]];
    lengths<- attr(matches, 'match.length')
    which_longest <- which.max(lengths)
    index_longest <- matches[which_longest]
    length_longest <- lengths[which_longest]
    longest_cmn_sbstr  <- substring(longest_string(c(a,b)), index_longest , index_longest + length_longest - 1)
    return(longest_cmn_sbstr ) 
}

以上两个函数仅将'hello '识别为最长公共子串，而不是'hello r'（无论哪个参数更长）：

identical('hello',
    lcsbstr_no_lib('hello', 'hello there'), 
    lcsbstr(       'hello', 'hello there'),
    lcsbstr_no_lib('hello there', 'hello'), 
    lcsbstr(       'hello there', 'hello'))

最近编辑 请注意这个结果有一些奇怪的行为：

lcsbstr('hello world', 'hello')
#[1] 'hell'

我原本期待的是'hello'，但由于转换实际上通过删除将world中的“o”移动到hello中成为“o” - 只有hell部分被认为是符合M的匹配。

drop(attr(adist('hello world', 'hello', counts=TRUE), "trafos"))
#[1] "MMMMDDDMDDD"
#[1]  vvvv   v
#[1] "hello world"

使用 Levenstein工具可以观察到这种行为，它提供了两个可能的解决方案，相当于这两个转换。

#[1] "MMMMDDDMDDD"
#[1] "MMMMMDDDDDD"

我不知道我们是否可以配置 adist 以偏爱一种解决方案？（这些转换具有相同的“权重”--相同数量的“M”和“D” - 不知道如何优先考虑具有更多连续 M 的转换）

最后，别忘了 adist 允许你传递 ignore.case = TRUE （默认为 FALSE）

adist 的 "trafos" 属性是关键；它是从一个字符串到另一个字符串所需的 "transformations"：

变换序列作为返回值的 "trafos" 属性返回，表示为字符串，其中包含元素 M、I、D 和 S，分别表示匹配、插入、删除和替换。

- Nate Anderson

1

为了完善你的解决方案，如果你知道想要选择LCS的字符串是a还是b，你可以在函数内部添加grep，并将'longest_cmn_sbstr'作为参数来返回完整的字符串。 - Vaibhav

1

我对你的函数分别输入了“apple”和“big apple bagels”，但它返回了“appl”。我本来期望的是“apple”。 - wdkrnls

1

是的@wdkrnls，我同意我的解决方案对于“longest”不正确--它依赖于Levenstein，可能会识别涉及“DELETIONS”的不同解决方案（请参见我的答案编辑）。这就是你得到“appl”的原因；这也是我得到这个结果的原因： lcsbstr('hello world', 'hello') #[1] 'hell' 也许我可以修改我的正则表达式，这样我就不仅仅寻找连续的“M”，而且还要检查跨越“D”（删除）的“M”（匹配）。 - Nate Anderson

2

LCSn函数（PTXQC包）可以在输入向量中找到所有字符串的最长公共字符串。需要注意的是，它使用最短的字符串作为基础，因此在比较多个字符串时可能无法给出您想要的结果。但是，它是比较两个序列的好选择。

library(PTXQC)
LCSn(c("hello","hello world"))
#[1] "hello"

LCSn(c("hello", "hel123l5678o"))
#[1] "hel"

- l0110

1

我不确定你是如何得到输出“hello”的。根据下面的试错实验，似乎LCS函数会（a）不将一个字符作为LCS，如果其后跟随着其他字符；（b）找到多个同样长度的LCS（与只找到第一个的sub()不同）；（c）字符串中元素的顺序不重要——这在下面没有展示；（d）调用LCS时字符串的顺序也不重要——同样未被展示。

因此，在b中，“hel”后跟着一个字符，所以a中的“hello”在b中没有LCS。这是我的当前假设。

上述A点：

a= c("hello", "hel", "abcd")
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # "abcd" - perhaps because it has nothing afterwards, unlike hello123...

a= c("hello", "hel", "abcd1") # added 1 to abcd
b= c("hello123l5678o", "abcd") 
print(LCS(a, b)[4]) # no LCS!, as if anything beyond an otherwise LCS invalidates it

a= c("hello", "hel", "abcd") 
b= c("hello1", "abcd") # added 1 to hello
print(LCS(a, b)[4]) # abcd only, since the b hello1 has a character

上面的B点：

a= c("hello", "hel", "abcd") 
b= c("hello", "abcd") 
print(LCS(a, b)[4]) # found both, so not like sub vs gsub of finding first or all

- lawyeR

1

律师，非常抱歉，我还没有完全理解。我正在寻找一个函数，它以两个字符串作为参数，并返回这两个字符串之间最长的公共子串。我有点困惑阅读上面的帖子。 - IAMTubby

1

我正在解释LCS可以做什么，以及不能做什么。 - lawyeR

律师，噢好的！但是，为了澄清，是否有更好的方法来找到两者之间的最长公共子串？ - IAMTubby

1

df <- data.frame(A. = c("Australia", "Network"),
                 B. = c("Austria", "Netconnect"), stringsAsFactors = FALSE)

 auxFun <- function(x) {

   a <- strsplit(x[[1]], "")[[1]]
   b  <- strsplit(x[[2]], "")[[1]]
   lastchar <- suppressWarnings(which(!(a == b)))[1] - 1

   if(lastchar > 0){
     out <- paste0(a[1:lastchar], collapse = "")
   } else {
     out <- ""
   }

   return(out)
 }

 df$C. <- apply(df, 1, auxFun)

 df
 A.         B.    C.
 1 Australia    Austria Austr
 2   Network Netconnect   Net

- Juan Antonio Roldán Díaz

1

这将适用于子字符串从两个字符串的开头开始的情况，但是如果子字符串出现在某些字符串之间，这将失败。 - Vaibhav

是的，你说得对。但是如果你考虑到子字符串出现在某个字符串之间，你可以为每个配对获得多个输出。而且，可以通过调整代码来获取在某个字符串之间匹配的第一个字符串。 - Juan Antonio Roldán Díaz

1

使用biostrings：

library(Biostrings)
a= "hello"
b="hel123l5678o"
astr= BString(a)
bstr=BString(b)

pmatchPattern(astr, bstr)

返回：

  Views on a 12-letter BString subject
Subject: hel123l5678o
views:
      start end width
  [1]     1   3     3 [hel]
  Views on a 5-letter BString pattern
Pattern: hello
views:
      start end width
  [1]     1   3     3 [hel]

所以我进行了基准测试，尽管我的答案确实做到了这一点，并且提供了更多的信息，但是它比@Rich Scriven慢了大约500倍。

system.time({
a= "hello"
b="123hell5678o"
rounds=100
for (i in 1:rounds) {
astr= BString(a)
bstr=BString(b)
pmatchPattern(astr, bstr)
}
})

system.time({
  c= "hello"
  d="123hell5678o"
  rounds=100
  for (i in 1:rounds) {
ta <- drop(attr(adist(c, d, counts=TRUE), "trafos"))
stri_sub(d, stri_locate_all_regex(ta, "M+")[[1]])
}
})

   user  system elapsed 
  2.476   0.027   2.510 

   user  system elapsed 
  0.006   0.000   0.005

- kana

0

我根据@Rich Scriven的答案修改了它以适应我的目的。目标是在向量中查找最长公共字符串，而不是在两个字符串之间查找。最后可以通过分组在data.table中使用它。

示例：

library(data.table)
library(stringi)

# create the function ------------------------------------

get.lcs.vector <- function(your.vector) {
  
  # get longest common string
  get.lcs <- function(x, y) {
    # get longest common string
    sb <- stri_sub(y, 1, 1:nchar(y))
    sstr <- na.omit(stri_extract_all_coll(x, sb, simplify=TRUE))
    result <- sstr[which.max(nchar(sstr))]
    return(result)
  }
  combi <- data.table(expand.grid(your.vector, your.vector, stringsAsFactors = F))[Var1 != Var2]
  combi.result <- unique(mapply(get.lcs, combi[[1]], combi[[2]]))
  lcs <- combi.result[which.min(nchar(combi.result))]
  return(lcs)
}

# example of data ------------------------------------

dt <- data.table(AN = c("APILCASERNB", "APILCASELNB", "APILCASEYHANB", 
                        "A15DPGY", "A15DPRD", "A15DPWH", "A15DPDB", "A15DPYW", "A15DPTL", 
                        "A15DP4PGY", "A15DP4PRD", "A15DP4PWH", "A15DP4PDB", "A15DP4PYW", 
                        "A15DP4PTL"),
                 Name = c("Example1", "Example1", "Example1", "Example2", 
                          "Example2", "Example2", "Example2", "Example2", "Example2", "Example3", 
                          "Example3", "Example3", "Example3", "Example3", "Example3"))

dt

##                AN     Name
##  1:   APILCASERNB Example1
##  2:   APILCASELNB Example1
##  3: APILCASEYHANB Example1
##  4:       A15DPGY Example2
##  5:       A15DPRD Example2
##  6:       A15DPWH Example2
##  7:       A15DPDB Example2
##  8:       A15DPYW Example2
##  9:       A15DPTL Example2
## 10:     A15DP4PGY Example3
## 11:     A15DP4PRD Example3
## 12:     A15DP4PWH Example3
## 13:     A15DP4PDB Example3
## 14:     A15DP4PYW Example3
## 15:     A15DP4PTL Example3


# smaller exmaple ------------------------------------

dt.ex <- dt[Name == unique(Name)[1]]
dt.ex

##               AN     Name
## 1:   APILCASERNB Example1
## 2:   APILCASELNB Example1
## 3: APILCASEYHANB Example1

get.lcs.vector(dt.ex$AN)

## [1] "APILCASE"

# you can also start from end like this
stri_reverse(get.lcs.vector(stri_reverse(dt.ex$AN)))



# Example on all data.table ------------------------------------

dt[, AN2 := get.lcs.vector(AN), Name]
dt

##                AN     Name      AN2
##  1:   APILCASERNB Example1 APILCASE
##  2:   APILCASELNB Example1 APILCASE
##  3: APILCASEYHANB Example1 APILCASE
##  4:       A15DPGY Example2    A15DP
##  5:       A15DPRD Example2    A15DP
##  6:       A15DPWH Example2    A15DP
##  7:       A15DPDB Example2    A15DP
##  8:       A15DPYW Example2    A15DP
##  9:       A15DPTL Example2    A15DP
## 10:     A15DP4PGY Example3  A15DP4P
## 11:     A15DP4PRD Example3  A15DP4P
## 12:     A15DP4PWH Example3  A15DP4P
## 13:     A15DP4PDB Example3  A15DP4P
## 14:     A15DP4PYW Example3  A15DP4P
## 15:     A15DP4PTL Example3  A15DP4P

- Dorian Grv

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rich Scriven · Accepted Answer

以下是三种可能的解决方案。

library(stringi)
library(stringdist)

a <- "hello"
b <- "hel123l5678o"

## get all forward substrings of 'b'
sb <- stri_sub(b, 1, 1:nchar(b))
## extract them from 'a' if they exist
sstr <- na.omit(stri_extract_all_coll(a, sb, simplify=TRUE))
## match the longest one
sstr[which.max(nchar(sstr))]
# [1] "hel"

在基本的 R 中也有 adist() 和 agrep()，而 stringdist 包中有一些运行 LCS 方法的函数。下面看一下 stringsidt。它返回未配对字符的数量。

stringdist(a, b, method="lcs")
# [1] 7

Filter("!", mapply(
    stringdist, 
    stri_sub(b, 1, 1:nchar(b)),
    stri_sub(a, 1, 1:nchar(b)),
    MoreArgs = list(method = "lcs")
))
#  h  he hel 
#  0   0   0

现在我进一步探索后，认为adist()可能是最好的选择。如果我们设置counts=TRUE，我们将得到一系列的Matches、Insertions等。那么如果你把它传递给stri_locate()，我们就可以使用该矩阵从a到b获取匹配项。

ta <- drop(attr(adist(a, b, counts=TRUE), "trafos")))
# [1] "MMMIIIMIIIIM"

因此，M 值表示直接匹配。我们可以使用 stri_sub() 获取子字符串。

stri_sub(b, stri_locate_all_regex(ta, "M+")[[1]])
# [1] "hel" "l"   "o"

对不起，我没有很好地解释清楚，因为我对字符串距离算法不是很熟悉。