查找匹配字符串算法

Question

查找匹配字符串算法

3

我有五个非常长的字符串（字符串数量可能会改变）。这些字符串没有固定格式。我会提供一个数字，表示子字符串的长度。我想要找到与给定长度匹配的子字符串。例如，字符串如下：

      1.     abcabcabc
      2.     abcasdfklop

字符串长度：3

给定这些值，输出将类似于以下内容：

匹配 #1：

 Matched string :               "abc"

 Matches in first string:        3

 Matching positions:             0,3,6

 Matches in second string:       1

 Match positions:                0

比赛 #2:

 Matched string :               "bca"

 Matches in first string:        2

 Matching positions:             1,4

 Matches in second string:       1

 Match    positions:             1

我用了4个foreach语句完成了这项任务。但我觉得这种方法效率太低了，特别是当输入的数据量很大时。在C#中，是否有更有效的建议或简短的方法来优化这个过程？

- Ozgur Dogus

尝试使用正则表达式。 - Hossein Narimani Rad

你能否请再详细解释一下？ - Ozgur Dogus

3

4个循环看起来是合理的。由于它可能只适用于非常狭窄的领域（例如DNA测序），不要对一般性论坛抱有太高期望... - Alexei Levenkov

1

是的，实际上这正是DNA测序。它将成为一项科学研究的一部分。所以我正在努力提供最好的结果 :) 感谢您的评论。 - Ozgur Dogus

你可以展示给我们你现有的代码，这样我们就可以帮助评论，看看是否有优化的空间。 - Raptor

4个回答

2

如果您能够使用4个不嵌套的foreach语句完成此操作，则应该可以了，您可能不需要进行优化。

以下是我会尝试的一些内容。创建一个类似于以下结构的结构。

class SubString
{
    string str;
    int position;
}

将两个字符串分别划分为所有可能的子字符串，并将其存储到一个数组中。这个过程的时间复杂度是O(n2)。

现在按照字符串长度对这些数组进行排序（时间复杂度为O(n*log(n))），并遍历它们以找出匹配项。

你需要额外的结构来保存结果，这可能还需要一些调整，但你可以看到这个方法的思路。

- Monte Chavis

1

你可以使用后缀树的变种来解决这个问题。 http://en.wikipedia.org/wiki/Longest_common_substring_problem 此外，还可以查看这个：Algorithm: Find all common substrings between two strings where order is preserved

- jack_carver

0

如果使用非常大的字符串，内存可能会成为一个问题。下面的代码找到最长的公共子串并覆盖包含较小公共子串的变量，但可以轻松地改为将索引和长度推送到列表中，然后作为字符串数组返回。

这是来自Ashutosh Singh的重构C++代码，链接在https://iq.opengenus.org/longest-common-substring-using-rolling-hash/ - 这将在O(N * log(N)^2)时间和O(N)空间中找到子字符串。

using System;
using System.Collections.Generic;
public class RollingHash
{
    private class RollingHashPowers
    {
        // _mod = prime modulus of polynomial hashing
        // any prime number over a billion should suffice
        internal const int _mod = (int)1e9 + 123;
        // _hashBase = base (point of hashing)
        // this should be a prime number larger than the number of characters used
        // in my use case I am only interested in ASCII (256) characters
        // for strings in languages using non-latin characters, this should be much larger
        internal const long _hashBase = 257;
        // _pow1 = powers of base modulo mod
        internal readonly List<int> _pow1 = new List<int> { 1 };
        // _pow2 = powers of base modulo 2^64
        internal readonly List<long> _pow2 = new List<long> { 1L };

        internal void EnsureLength(int length)
        {
            if (_pow1.Capacity < length)
            {
                _pow1.Capacity = _pow2.Capacity = length;
            }
            for (int currentIndx = _pow1.Count - 1; currentIndx < length; ++currentIndx)
            {
                _pow1.Add((int)(_pow1[currentIndx] * _hashBase % _mod));
                _pow2.Add(_pow2[currentIndx] * _hashBase);
            }
        }
    }

    private class RollingHashedString
    {
        readonly RollingHashPowers _pows;
        readonly int[] _pref1; // Hash on prefix modulo mod
        readonly long[] _pref2; // Hash on prefix modulo 2^64

        // Constructor from string:
        internal RollingHashedString(RollingHashPowers pows, string s, bool caseInsensitive = false)
        {
            _pows = pows;
            _pref1 = new int[s.Length + 1];
            _pref2 = new long[s.Length + 1];

            const long capAVal = 'A';
            const long capZVal = 'Z';
            const long aADif = 'a' - 'A';

            unsafe
            {
                fixed (char* c = s)
                {
                    // Fill arrays with polynomial hashes on prefix
                    for (int i = 0; i < s.Length; ++i)
                    {
                        long v = c[i];
                        if (caseInsensitive && capAVal <= v && v <= capZVal)
                        {
                            v += aADif;
                        }
                        _pref1[i + 1] = (int)((_pref1[i] + v * _pows._pow1[i]) % RollingHashPowers._mod);
                        _pref2[i + 1] = _pref2[i] + v * _pows._pow2[i];
                    }
                }
            }
        }

        // Rollingnomial hash of subsequence [pos, pos+len)
        // If mxPow != 0, value automatically multiply on base in needed power.
        // Finally base ^ mxPow
        internal Tuple<int, long> Apply(int pos, int len, int mxPow = 0)
        {
            int hash1 = _pref1[pos + len] - _pref1[pos];
            long hash2 = _pref2[pos + len] - _pref2[pos];
            if (hash1 < 0)
            {
                hash1 += RollingHashPowers._mod;
            }
            if (mxPow != 0)
            {
                hash1 = (int)((long)hash1 * _pows._pow1[mxPow - (pos + len - 1)] % RollingHashPowers._mod);
                hash2 *= _pows._pow2[mxPow - (pos + len - 1)];
            }
            return Tuple.Create(hash1, hash2);
        }
    }

    private readonly RollingHashPowers _rhp;
    public RollingHash(int longestLength = 0)
    {
        _rhp = new RollingHashPowers();
        if (longestLength > 0)
        {
            _rhp.EnsureLength(longestLength);
        }
    }

    public string FindCommonSubstring(string a, string b, bool caseInsensitive = false)
    {
        // Calculate max neede power of base:
        int mxPow = Math.Max(a.Length, b.Length);
        _rhp.EnsureLength(mxPow);
        // Create hashing objects from strings:
        RollingHashedString hash_a = new RollingHashedString(_rhp, a, caseInsensitive);
        RollingHashedString hash_b = new RollingHashedString(_rhp, b, caseInsensitive);

        // Binary search by length of same subsequence:
        int pos = -1;
        int low = 0;
        int minLen = Math.Min(a.Length, b.Length);
        int high = minLen + 1;
        var tupleCompare = Comparer<Tuple<int, long>>.Default;
        while (high - low > 1)
        {
            int mid = (low + high) / 2;
            List<Tuple<int, long>> hashes = new List<Tuple<int, long>>(a.Length - mid + 1);
            for (int i = 0; i + mid <= a.Length; ++i)
            {
                hashes.Add(hash_a.Apply(i, mid, mxPow));
            }
            hashes.Sort(tupleCompare);
            int p = -1;
            for (int i = 0; i + mid <= b.Length; ++i)
            {
                if (hashes.BinarySearch(hash_b.Apply(i, mid, mxPow), tupleCompare) >= 0)
                {
                    p = i;
                    break;
                }
            }
            if (p >= 0)
            {
                low = mid;
                pos = p;
            }
            else
            {
                high = mid;
            }
        }
        // Output answer:
        return pos >= 0
            ? b.Substring(pos, low)
            : string.Empty;
    }
}

- Brent

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tmyklebu · Accepted Answer

你可以使用后缀数组来实现。 (后缀树也可以正常工作，但它们需要更多的空间、时间和实现上的关注。)

将两个字符串连接起来，使用一个两个字符串都不含有的字符作为分隔符。然后构建一个后缀数组，就可以读出答案。

标准的后缀数组会给你一个按词典序排序的指向字符串后缀的指针数组，以及一个“最长公共前缀长度”数组，告诉你相邻的两个字典序后缀的最长公共前缀长度是多少。

使用最长公共前缀长度数组获取所需信息相当简单；找到所有最长公共前缀长度至少为查询长度的最大子数组，然后对于每个在第一个字符串和第二个字符串中都有匹配的子串，报告相应的前缀并报告它出现了 K+1 次，其中 K 是最大子数组的长度。

另一种更易编码的方法是对所有适当长度的子串进行哈希处理。你可以用任何滚动哈希函数轻松地完成这个过程。为每个哈希存储一个指向字符串的动态数组；一旦你已经哈希处理完所有的字符串，就遍历所有出现的哈希值，并查找匹配项。你需要想办法处理假阳性；一种（概率性）的方法是使用多个哈希函数，直到假阳性的概率可以接受。另一种方法，在你只有少量匹配项的情况下可能只能接受，那就是直接比较字符串。