在一个字符流中统计一个单词出现的次数

3

面试时有人问我这个问题,虽然我在数据结构和算法方面很擅长,但是这个问题我没有能够解决。不管怎样,这是一个有趣的问题,所以我想分享一下。

问题:你有一个字符流输入,并且你需要计算一个单词出现的次数。你只有一个API可以从流中读取字符,即stream.next_char(),如果没有字符则返回"\0"。

int count_occurrences(Stream stream, String word) {
// you have only one function provided from Stream class that you can use to 
// read one char at a time, no length/size etc.
// stream.next_char() - return "\0" if end
}

输入: "aabckjhabcc" 关键词: "abc" 输出: 2


“word” 是如何定义的?我的意思是,在你的示例中,输入流没有分隔符,所以我认为它不包含单词“abc”。 - Iłya Bursov
2
简单(次优)的答案:保持一个环形缓冲区,其中包含最后lengthOfTheWord个字符,并在每次调用next_char()后进行完整的字符串比较。更好答案的提示:是否有子字符串搜索算法只检查文本中递增顺序的字符,而不跳过某些字符? - j_random_hacker
3
听起来你想要一个DFA来解析字符串并计算访问最终状态的次数。 - G. Bach
1
另一种解决方法是使用Knuth-Morris-Prath算法。但是它需要占用O(n)的内存。 - Ivan Gritsenko
@IvanGritsenko;这就是我们需要找到的简单算法。@j_random_hacker提供的缓冲解决方案对我来说看起来不错。 - Rahul.B
显示剩余2条评论
5个回答

0

最简单的解决方案是使用带有多达 word.length() 个符号的缓冲区:

static int count_occurrences(final Stream stream, final String word) {
    int found = 0;

    char c;
    String tmpWord = "";
    while ((c = stream.next_char()) != 0) {
        tmpWord += c;
        if (tmpWord.length() > word.length()) {
            tmpWord = tmpWord.substring(1);
        }
        if (tmpWord.equals(word)) {
            found++;
        }
    }
    return found;
}

复杂度为O(N*M),内存占用为O(M)


2
尝试使用以下内容:word = "ababc",stream = "abababc" - obe
它能处理重叠的单词吗?例如,在“abbabba”中,它能找到“abba”的两个出现吗? - Jim Mischel

0
int count_occurrences(Stream stream, String word) {
    int occurrence = 0;
    int current_index = 0;
    int word_length = word.length();
    char[] word_chars = word.toCharArray();

    char c = stream.next_char();
    while(c != '\0') {
        if( c == word_chars[current_index] ) {
            current_index++;
            if(current_index >= word_length) {
                occurrence++;
                current_index = 0;
            }
        }
        else {
            current_index = 0;
            if( c == word_chars[current_index] ) {
                current_index++;
            }
        }
        c = stream.next_char();

    }
    return occurrence;
}

尝试使用以下代码进行测试: word = "ababc", stream = "abababc" - @obe - Iłya Bursov
确实是个好发现! - PetrosP

0
也许是这样的吗?
int count_occurrences(Stream stream, String word) {
    // you have only one function provided from Stream class that you can use to 
    // read one char at a time, no length/size etc.
    // stream.next_char() - return "\0" if end

    List<int> positions = new List<int>();

    int counter = 0;
    while (true) {
        char ch = stream.next_char();
        if (ch == '\0') return counter;

        if (ch == word.charAt(0)) {
            positions.add(0);
        }

        int i = 0;
        while (i < positions.length) {
            int pos = positions[i];

            if (word.charAt(pos) != ch) {
                positions.remove(i);
                continue;
            }

            pos++;
            if (pos == word.length()) {
                positions.remove(i);
                counter++;
                continue;
            }

            positions[i] = pos;
            i++;
        }
    }
}

我想最坏情况下的时间复杂度将是O(N*N),而内存复杂度为O(N)。 - Iłya Bursov

0

我认为这个问题与使用有限状态计算模型匹配字符串的想法有关。

可以通过使用KMP字符串匹配算法来解决此问题。

KMP算法尝试在文本字符串中查找模式字符串的匹配项,考虑到即使在某些点上发现不匹配,仍然匹配了多少前缀。

为了确定“如果我们在匹配到模式索引i之后遇到不匹配,还能匹配多少前缀”,需要提前构建一个失败函数。(请参考下面的代码以获取构建失败函数值的方法)

对于每个模式索引i,该失败函数将告诉我们,即使在索引i之后遇到不匹配,模式的多少前缀仍然可以匹配。

思路是找出模式的最长合适前缀的长度,该前缀也是其每个子字符串(由1到i索引表示,其中i从1到n)的后缀。

我使用字符串索引从1开始。

因此,任何模式的第一个字符的失败函数值为0。(即到目前为止没有匹配任何字符)。

对于后续字符,对于每个索引i = 2到n,我们看一下pattern[1...i]子串的最长合适前缀是该子串的哪个后缀。

假设我们的模式是"aac",则索引1的失配函数值为0(尚未匹配),索引2的失配函数值为1(与“aa”的最长合适前缀相同的最长合适后缀长度为1)

对于模式"ababac",索引1的失配函数值为0, 索引2的失配函数值为0,索引3的失配函数值为1(因为第三个索引'a'与第一个索引'a'相同),索引4的失配函数值为2(因为在索引1和2处的“ab”与在索引3和4处的“ab”相同),而索引5的失配函数值为3(在索引[1...3]处的“aba”与在索引[3...5]处的“aba”相同)。对于索引6,失配函数值为0。

以下是构建失配函数并使用它匹配文本(或流)的代码(C++):

/* Assuming that string indices start from 1 for both pattern and text. */
/* Firstly build the failure function. */
int s = 1;
int t = 0;  

/* n denotes the length of the pattern */
int *f = new int[n+1];
f[1] = 0;   

for (s = 1; s < n; s++) {
    while (t > 0 && pattern[t + 1] != pattern[s + 1]) {
        t = f[t];
    }
    if (pattern[t + 1] == pattern[s + 1]) {
        t++;
        f[s + 1] = t;
    }
    else {
        f[s + 1] = 0;           
    }
}

/* Now start reading characters from the stream */
int count = 0;
char current_char = stream.next_char();

/* s denotes the index of pattern upto which we have found match in text */
/* initially its 0 i.e. no character has been matched yet. */
s = 0; 
while (current_char != '\0') {

    /* IF previously, we had matched upto a certain number of
       characters, and then failed, we return s to the point
       which is the longest prefix that still might be matched.

       (spaces between string are just for clarity)
       For e.g., if pattern is              "a  b  a  b  a  a" 
       & its failure returning index is     "0  0  1  2  3  1"

       and we encounter 
       the text like :      "x  y  z  a  b  a  b  a  b  a  a" 
              indices :      1  2  3  4  5  6  7  8  9  10 11

       after matching the substring "a  b  a  b  a", starting at
       index 4 of text, we are successful upto index 8  but we fail
       at index 9, the next character at index 9 of text is 'b'
       but in our pattern which should have been 'a'.Thus, the index
       in pattern which has been matched till now is 5 ( a  b  a  b  a)
                                                         1  2  3  4  5
       Now, we see that the failure returning index at index 5 of 
       pattern is 3, which means that the text is still matched upto
       index 3 of pattern (a  b  a), not from the initial starting 
       index 4 of text, but starting from index 6 of text.

       Thus we continue searching assuming that our next starting index
       in text is 6, eventually finding the match starting from index 6
       upto index 11.    

       */
        while (s > 0 && current_char != pattern[s + 1]) {
            s = f[s];
        }
        if (current_char == pattern[s + 1]) s++; /* We test the next character after the currently
                                                    matched portion of pattern with the current 
                                                    character of text , if it matches, we increase
                                                    the size of our matched portion by 1*/
        if (s == n) {
            count++;
        }
        current_char = stream.next_char();
}

printf("Count is %d\n", count);

`

注意:此方法可帮助查找重叠模式出现的计数。例如,在流“ababa”中,单词“aba”出现了两次。

0
他们可能在寻找的是Rabin-Karp或Knuth-Morris-Pratt算法。这两种算法只需要一次遍历,开销非常小。如果模式很大,从速度上来看,它们将是明显的优胜者,因为复杂度是O(stream_length)
Rabbin-Karp算法依赖于哈希,在下一个字符中可以更新O(1)。如果哈希不太好或流非常长(哈希冲突),则可能会给出错误的结果。
Knuth-Morris-Pratt算法依赖于计算每个位置的最长前缀和后缀的长度。这需要O(n)的内存来存储这些结果,但仅此而已。
在维基百科的字符串模式匹配页面上查找更多详细信息和实现方法。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接