3个或更多字符串的最长公共子序列

Question

3个或更多字符串的最长公共子序列

24

我正在尝试寻找3个或更多字符串的最长公共子序列。维基百科文章有一个很好的描述，介绍了如何为两个字符串解决此问题，但我不太确定如何将其扩展到3个或更多字符串。

有许多库可以用于查找2个字符串的LCS，因此如果可能的话，我想使用其中之一。如果我有3个字符串A、B和C，那么将A和B的LCS命名为X，然后找到X和C的LCS，这样做是否有效？还是说这种方法是错误的？

我已经在Python中实现了它，代码如下:

import difflib

def lcs(str1, str2):
    sm = difflib.SequenceMatcher()
    sm.set_seqs(str1, str2)
    matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]
    return "".join(matching_blocks)

print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])

这会输出 "ba"，但实际上应该是 "baa"。

- del

这是一个非常著名的问题。只需谷歌一下即可。请参阅[http://www.springerlink.com/content/fu4t4442l7577712/]，其中介绍了两种解决此问题的算法。还可以查看[http://www.perlmonks.org/?node_id=548827]（谷歌上还有许多其他链接）。 - Luixv

2

这是“baa”。他想要最长的公共子序列，而不是最长的公共子串。 - ypercubeᵀᴹ

3

你提到的迭代方法行不通。例如，它会给出abc12、12abc、12xyz 错误的结果。 - sth

2

请注意，任意两个字符串可能有多个最长公共子序列。例如，对于您的前两个字符串 ('abacbdab', 'bdcaba')，至少存在五个最长公共子序列： "baa", "bab", "bca", "bcb", "dab"。您的代码是否考虑到了这一点？ - ypercubeᵀᴹ

2

哎呀，手动查找最长公共子序列并不那么简单！在上面的例子中有三个字符串：“bcab”，“bdab”和“bcba”，这就解释了为什么你的代码无法找到“baa”。但是重点仍然有效，两个字符串之间可能会有多个最长公共子序列。 - ypercubeᵀᴹ

@ypercube - 好观点，我的代码没有考虑到这一点，因为difflib只给出一个LCS。 - del

5个回答

9

我只是为了一份作业而写了这个Python的动态规划解决方案，它非常高效。它的时间复杂度为O(nml)，其中n、m和l是三个序列的长度。

这个解决方案的工作原理是创建一个三维数组，然后枚举所有三个序列来计算最长子序列的路径。然后你可以通过回溯数组来从其路径重构实际子序列。

因此，你需要将数组初始化为所有零，然后枚举三个序列。在枚举的每一步中，如果有匹配项，则将最长子序列的长度加1，否则就从枚举的前一步中继续最长子序列。

完成枚举后，现在可以通过遍历数组来从你所采取的步骤中重构子序列。从数组中的最后一个条目向后移动时，每次遇到匹配项都会在任何序列中查找它（使用数组的坐标），并将其添加到子序列中。

def lcs3(a, b, c):
    m = len(a)
    l = len(b)
    n = len(c)
    subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]

    for i, x in enumerate(a):
        for j, y in enumerate(b):
            for k, z in enumerate(c):
                if x == y and y == z:
                    subs[i+1][j+1][k+1] = subs[i][j][k] + 1
                else:
                    subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k], 
                                              subs[i][j+1][k+1], 
                                              subs[i+1][j][k+1])
    # return subs[-1][-1][-1] #if you only need the length of the lcs
    lcs = ""
    while m > 0 and l > 0 and n > 0:
        step = subs[m][l][n]
        if step == subs[m-1][l][n]:
            m -= 1
        elif step == subs[m][l-1][n]:
            l -= 1
        elif step == subs[m][l][n-1]:
            n -= 1
        else:
            lcs += str(a[m-1])
            m -= 1
            l -= 1
            n -= 1

    return lcs[::-1]

- BarthesSimpson

4

以下代码可以在N个字符串中找到最长的公共子序列。使用itertools生成所需的索引组合，然后使用这些索引查找公共子字符串。

执行示例：
输入：
输入序列数：3
输入序列1：83217
输入序列2：8213897
输入序列3：683147

输出：
837

from itertools import product
import numpy as np
import pdb

def neighbors(index):
    N = len(index)
    for relative_index in product((0, -1), repeat=N):
        if not all(i == 0 for i in relative_index):
            yield tuple(i + i_rel for i, i_rel in zip(index, relative_index))

def longestCommonSubSequenceOfN(sqs):
    numberOfSequences = len(sqs);
    lengths = np.array([len(sequence) for sequence in sqs]);
    incrLengths = lengths + 1;
    lengths = tuple(lengths);
    inverseDistances = np.zeros(incrLengths);
    ranges = [tuple(range(1, length+1)) for length in lengths[::-1]];
    for tupleIndex in product(*ranges):
        tupleIndex = tupleIndex[::-1];
        neighborIndexes = list(neighbors(tupleIndex));
        operationsWithMisMatch = np.array([]);
        for neighborIndex in neighborIndexes:
            operationsWithMisMatch = np.append(operationsWithMisMatch, inverseDistances[neighborIndex]);
        operationsWithMatch = np.copy(operationsWithMisMatch);
        operationsWithMatch[-1] = operationsWithMatch[-1] + 1;
        chars = [sqs[i][neighborIndexes[-1][i]] for i in range(numberOfSequences)];
        if(all(elem == chars[0] for elem in chars)):
            inverseDistances[tupleIndex] = max(operationsWithMatch);
        else:
            inverseDistances[tupleIndex] = max(operationsWithMisMatch);
        # pdb.set_trace();

    subString = "";
    mainTupleIndex = lengths;
    while(all(ind > 0 for ind in mainTupleIndex)):
        neighborsIndexes = list(neighbors(mainTupleIndex));
        anyOperation = False;
        for tupleIndex in neighborsIndexes:
            current = inverseDistances[mainTupleIndex];
            if(current == inverseDistances[tupleIndex]):
                mainTupleIndex = tupleIndex;
                anyOperation = True;
                break;
        if(not anyOperation):
            subString += str(sqs[0][mainTupleIndex[0] - 1]);
            mainTupleIndex = neighborsIndexes[-1];
    return subString[::-1];

numberOfSequences = int(input("Enter the number of sequences: "));
sequences = [input("Enter sequence {} : ".format(i)) for i in range(1, numberOfSequences + 1)];
print(longestCommonSubSequenceOfN(sequences));

- Naveen Chelimilla

4

要找到两个字符串A和B的最长公共子序列（LCS），可以像你发布的链接中所示，沿着对角线遍历一个二维数组。数组中的每个元素都对应于查找子串A'和B'（A由其行号切割，B由其列号切割）的LCS的问题。可以通过计算数组中所有元素的值来解决此问题。在计算数组元素的值时，必须确保计算该值所需的所有子问题已经得到解决。这就是为什么要沿着二维数组对角线遍历的原因。

可以将此解决方案扩展到查找N个字符串之间的最长公共子序列，但这需要一种通用的方法来迭代N维数组，以便只有在解决元素所需的所有子问题时才能到达任何元素。

与特定顺序迭代N维数组不同，您还可以递归解决此问题。使用递归时，保存中间解非常重要，因为许多分支将需要相同的中间解。我已经用C#编写了一个小例子来实现这一点：

string lcs(string[] strings)
{
    if (strings.Length == 0)
        return "";
    if (strings.Length == 1)
        return strings[0];
    int max = -1;
    int cacheSize = 1;
    for (int i = 0; i < strings.Length; i++)
    {
        cacheSize *= strings[i].Length;
        if (strings[i].Length > max)
            max = strings[i].Length;
    }
    string[] cache = new string[cacheSize];
    int[] indexes = new int[strings.Length];
    for (int i = 0; i < indexes.Length; i++)
        indexes[i] = strings[i].Length - 1;
    return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
    for (int i = 0; i < indexes.Length; i++ )
        if (indexes[i] == -1)
            return "";
    bool match = true;
    for (int i = 1; i < indexes.Length; i++)
    {
        if (strings[0][indexes[0]] != strings[i][indexes[i]])
        {
            match = false;
            break;
        }
    }
    if (match)
    {
        int[] newIndexes = new int[indexes.Length];
        for (int i = 0; i < indexes.Length; i++)
            newIndexes[i] = indexes[i] - 1;
        string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
        cache[calcCachePos(indexes, strings)] = result;
        return result;
    }
    else
    {
        string[] subStrings = new string[strings.Length];
        for (int i = 0; i < strings.Length; i++)
        {
            if (indexes[i] <= 0)
                subStrings[i] = "";
            else
            {
                int[] newIndexes = new int[indexes.Length];
                for (int j = 0; j < indexes.Length; j++)
                    newIndexes[j] = indexes[j];
                newIndexes[i]--;
                int cachePos = calcCachePos(newIndexes, strings);
                if (cache[cachePos] == null)
                    subStrings[i] = lcsBack(strings, newIndexes, cache);
                else
                    subStrings[i] = cache[cachePos];
            }
        }
        string longestString = "";
        int longestLength = 0;
        for (int i = 0; i < subStrings.Length; i++)
        {
            if (subStrings[i].Length > longestLength)
            {
                longestString = subStrings[i];
                longestLength = longestString.Length;
            }
        }
        cache[calcCachePos(indexes, strings)] = longestString;
        return longestString;
    }
}
int calcCachePos(int[] indexes, string[] strings)
{
    int factor = 1;
    int pos = 0;
    for (int i = 0; i < indexes.Length; i++)
    {
        pos += indexes[i] * factor;
        factor *= strings[i].Length;
    }
    return pos;
}

我的示例代码可以进一步优化。许多被缓存的字符串是重复的，有些只是添加了一个额外的字符而已。当输入字符串变大时，这会占用更多的空间。

对于输入："666222054263314443712"、"5432127413542377777" 和 "6664664565464057425"

LCS 返回的是 "54442"。

- Kent Munthe Caspersen

2

这里是解决方案的链接查看此处的解释输出是LCS的长度为2

 # Python program to find 
 # LCS of three strings 

 # Returns length of LCS 
 # for X[0..m-1], Y[0..n-1] 
 # and Z[0..o-1] 
def lcsOf3(X, Y, Z, m, n, o): 

    L = [[[0 for i in range(o+1)] for j in range(n+1)] 
        for k in range(m+1)] 

    ''' Following steps build L[m+1][n+1][o+1] in 
    bottom up fashion. Note that L[i][j][k] 
    contains length of LCS of X[0..i-1] and 
    Y[0..j-1] and Z[0.....k-1] '''
   for i in range(m+1): 
    for j in range(n+1): 
        for k in range(o+1): 
            if (i == 0 or j == 0 or k == 0): 
                L[i][j][k] = 0

            elif (X[i-1] == Y[j-1] and
                  X[i-1] == Z[k-1]): 
                L[i][j][k] = L[i-1][j-1][k-1] + 1

            else: 
                L[i][j][k] = max(max(L[i-1][j][k], 
                L[i][j-1][k]), 
                                L[i][j][k-1]) 

      # L[m][n][o] contains length of LCS for 
      # X[0..n-1] and Y[0..m-1] and Z[0..o-1] 
    return L[m][n][o] 

  # Driver program to test above function 

X = 'AGGT12'
Y = '12TXAYB'
Z = '12XBA'

m = len(X) 
n = len(Y) 
o = len(Z) 

print('Length of LCS is', lcsOf3(X, Y, Z, m, n, o)) 

# This code is contributed by Soumen Ghosh.

- Nsikan Sylvester

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- IVlad · Accepted Answer

33

只需将递归关系推广至通用形式。

对于三个字符串：

dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
              max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

这应该很容易推广到更多的字符串。

- IVlad

1

我按照你的建议从这里调整了LCS代码：http://code.activestate.com/recipes/576869-longest-common-subsequence-problem-solver/。它可以工作，但速度非常慢，并且在较长序列上会遇到最大递归深度错误。我希望有一个已经实现了这个功能的库。 - del

2

@del - 它不会变得更快，因为这个问题没有更好的精确解决方案。 - IVlad

我已经发布了一个解决方案，它使用动态规划来保存中间结果，以实现更快的解决方案和更少的内存使用。 - Kent Munthe Caspersen

我的解决方案使用递归，并适用于n个字符串。它使用一个n维数组，其中条目[i，j，k，...]存储在字符串1（0，i），字符串2（0，j），字符串3（0，k）之间找到LCS问题的解决方案。 - Kent Munthe Caspersen

我写得比较快，所以代码不是很干净，但是因为我在其他地方找不到实现，所以我决定还是发布出来。它的复杂度是O(n^m)，其中n是最大字符串长度，m是字符串数量。你的解决方案简单而强大，完美地展示了这种技术。 - Kent Munthe Caspersen

显示剩余5条评论