查找字符串中包含另一个字符串的任意字母排列作为子序列的子字符串数量

4

我们需要找到一个字符串中包含另一个字符串的某个字谜作为子序列的子串数量。

只有当起始位置或结束位置不同的情况下,才认为这些子串是不同的。

String="aba"
anotherString="a"

Occurence of "a" in "aba" is as follows :

a     at index 0..0
ab    at index 0..1
aba   at index 0..2
ba    at index 1..2
a     at index 2..2

i.e total of 5 times...so o/p=5
(the start and end points here, are inclusive)

我认为这个问题涉及到“字符串中子序列的出现次数”和“查找包含另一个字符串所有字符的最小窗口”的应用之一。
但是,即使我在合并代码时进行了许多更改,我仍然无法得出解决方案。将我的代码粘贴过来是没有用的,因为我知道我错在哪里。我想知道的是,我们如何在不使用蛮力解决方案的情况下有效地解决这个问题。
代码:
public static void findLengthAndSequence(String str1,String str2){

    int begin=0,biginWith=0,endWith=0,count=0,minLen=Integer.MAX_VALUE,len=0;
    int l=0;

    int [] hasFound=new int[256];
    int [] toFound=new int[256];

    for(int i=0;i<str2.length();++i){           
        toFound[(int)str2.charAt(i)]++;
    }

    for(int end=0;end<str1.length();++end){
        if(toFound[(int)str1.charAt(end)]==0)
            continue;
        hasFound[(int)str1.charAt(end)]++;
        if(hasFound[(int)str1.charAt(end)]<=toFound[(int)str1.charAt(end)]){
            count++;
        }

        if(count==str2.length()){
            l++;        //add this to find number of such anagram in string
            System.out.println("l= "+l+" "+begin+" "+end); 
            while(toFound[(int)str1.charAt(begin)]==0 || hasFound[(int)str1.charAt(begin)]>toFound[(int)str1.charAt(begin)]  )
            {
                if(hasFound[(int)str1.charAt(begin)]>toFound[(int)str1.charAt(begin)]){
                    hasFound[(int)str1.charAt(begin)]-=1;                       
                }
                begin++;
            }//while
        len=end-begin+1;
        if(minLen>len){
            minLen=len;
            endWith=end;
            biginWith=begin;
        }
    }//if   
    }//end

    for(int i=biginWith;i<=endWith;++i){
        System.out.print(""+str1.charAt(i));
    }
}

这段代码会输出3,即上面的问题的答案。 我知道一旦遍历完第一个字符串并到达末尾,我就无法检查每个子字符串了。
e.g in "aba" my code checks for a,ab,aba.but once I reach the end it will not check   
ba,a .since we need to count this also as they are having different index values.

除了指数时间复杂度的蛮力方法,还有其他方法可以检查每个可能的子字符串吗?


检查所有子字符串并不是指数级的。在长度为 n 的字符串中,恰好有 n * (n - 1) / 2 个子字符串。这显然是一个多项式。 - kraskevich
那么你想要达到什么时间复杂度? - kraskevich
@ILoveCoding 那是一个错误..谢谢..但每个子字符串检查都有额外的成本,复杂度为O(MN)...我认为我们无法得到比O(MN)更好的结果,其中M和N是字符串长度。你有任何算法来解决这个问题吗? - Cyclotron3x3
1个回答

5
这里有一个简单的解决方案,时间复杂度为O(n + m)(假设字母表大小是常数,其中n是第一个字符串(我们要计算子字符串的字符串)的长度,m是第二个字符串(变位词字符串)的长度)。我将称包含第二个字符串的变位词的子字符串为“好”。
  1. Let's define count(x, y) as the number of occurrences of a y character in a string x. Then an arbitrary string s contains an anagram of a string t as a subsequence if and only if count(s, c) >= count(t, c) for all c(the proof is simple so I will omit it).

  2. Let's define firstRight(L) as the smallest R such that a [L, R] substring is a good one(it is possible that there is no such R). Then firstRight(L) <= firstRight(L + 1) for all valid L(because of the 1. and the properties of the count(x, y) function).

  3. The statment 1. implies that any string can be represented as a vector with alphabetSize elements, where the i-th element of this vector is the number of occurrences of the character i. The statement 2. implies that we can use two pointers.

  4. So a pseudo code of this algorithm looks like this:

    def getCharacterVector(string s):
        result = a vector filled with zeros
        for c in s
            result[c]++
        return result
    
    // Checks that all elements of the first vector
    // are greater than or equal to the corresponding
    // elements of the second vector
    def isGreaterOrEqual(first, second)
        for i = 0 ... length(first)
            if first[i] < second[i]
                return false
        return true
    
    def countSubstrings(string s, string t)
        vT = getCharacterVector(t)
        vS = a vector filled with zeros
        right = 0
        // computes firstRight(0)
        while (right < length(s) and not isGreaterOrEqual(vS, vT))
            vS[s[right]]++
            right++
        if not isGreaterOrEqual(vS, vT) // firstRight(0) is undefined
            return 0 // there are no such substrings
        res = length(s) - right + 1
        for left = 1 ... length(s) - 1
            vS[s[left - 1]]--
            // computes firstRight(left)
            while right < length(s) and vS[s[left - 1]] < vT[s[left - 1]]
                vS[s[right]]++
                right++
            if vS[s[left - 1]] < vT[s[left - 1]] // firstRight(left) is undefined
                break // we are done
             res += length(s) - right + 1
        return res
    

    The idea here is two compute the number of good substrings that start in a fixed position and end anywhere and use two pointers two adjust the right border efficiently. The time complexity of this implementation is O(N * ALPHABET_SIZE + M)(which is O(N + M) if we treat the alphabet size as a constant), but is actually possible to do the firstRight(0) computation more efficient by keeping track of the "bad" positions in vS and vT vector and represent this vectors as hash tables to achieve O(N + M) the complexity regardless of the alphabet size.


看起来不错 - 使这个线性时间的另一种方法是在firstRight(0)计算中跟踪已找到多少匹配项。当vS[s[right]]第一次等于vT[s[right]]时,您可以通过vS[s[right]]增加匹配项的数量。一旦匹配项变成t的长度,循环就可以停止了。 - Peter de Rivaz

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接