在一个字符流中统计一个单词出现的次数

Question

在一个字符流中统计一个单词出现的次数

3

面试时有人问我这个问题，虽然我在数据结构和算法方面很擅长，但是这个问题我没有能够解决。不管怎样，这是一个有趣的问题，所以我想分享一下。

问题：你有一个字符流输入，并且你需要计算一个单词出现的次数。你只有一个API可以从流中读取字符，即stream.next_char()，如果没有字符则返回"\0"。

int count_occurrences(Stream stream, String word) {
// you have only one function provided from Stream class that you can use to 
// read one char at a time, no length/size etc.
// stream.next_char() - return "\0" if end
}

输入: "aabckjhabcc" 关键词: "abc" 输出: 2

- Rahul.B

“word” 是如何定义的？我的意思是，在你的示例中，输入流没有分隔符，所以我认为它不包含单词“abc”。 - Iłya Bursov

2

简单（次优）的答案：保持一个环形缓冲区，其中包含最后lengthOfTheWord个字符，并在每次调用next_char()后进行完整的字符串比较。更好答案的提示：是否有子字符串搜索算法只检查文本中递增顺序的字符，而不跳过某些字符？ - j_random_hacker

3

听起来你想要一个DFA来解析字符串并计算访问最终状态的次数。 - G. Bach

1

另一种解决方法是使用Knuth-Morris-Prath算法。但是它需要占用O(n)的内存。 - Ivan Gritsenko

@IvanGritsenko；这就是我们需要找到的简单算法。@j_random_hacker提供的缓冲解决方案对我来说看起来不错。 - Rahul.B

显示剩余2条评论

5个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Iłya Bursov · Answer 1

最简单的解决方案是使用带有多达 word.length() 个符号的缓冲区：

static int count_occurrences(final Stream stream, final String word) {
    int found = 0;

    char c;
    String tmpWord = "";
    while ((c = stream.next_char()) != 0) {
        tmpWord += c;
        if (tmpWord.length() > word.length()) {
            tmpWord = tmpWord.substring(1);
        }
        if (tmpWord.equals(word)) {
            found++;
        }
    }
    return found;
}

复杂度为O(N*M)，内存占用为O(M)

- PetrosP · Answer 2

int count_occurrences(Stream stream, String word) {
    int occurrence = 0;
    int current_index = 0;
    int word_length = word.length();
    char[] word_chars = word.toCharArray();

    char c = stream.next_char();
    while(c != '\0') {
        if( c == word_chars[current_index] ) {
            current_index++;
            if(current_index >= word_length) {
                occurrence++;
                current_index = 0;
            }
        }
        else {
            current_index = 0;
            if( c == word_chars[current_index] ) {
                current_index++;
            }
        }
        c = stream.next_char();

    }
    return occurrence;
}

- obe · Answer 3

也许是这样的吗？

int count_occurrences(Stream stream, String word) {
    // you have only one function provided from Stream class that you can use to 
    // read one char at a time, no length/size etc.
    // stream.next_char() - return "\0" if end

    List<int> positions = new List<int>();

    int counter = 0;
    while (true) {
        char ch = stream.next_char();
        if (ch == '\0') return counter;

        if (ch == word.charAt(0)) {
            positions.add(0);
        }

        int i = 0;
        while (i < positions.length) {
            int pos = positions[i];

            if (word.charAt(pos) != ch) {
                positions.remove(i);
                continue;
            }

            pos++;
            if (pos == word.length()) {
                positions.remove(i);
                counter++;
                continue;
            }

            positions[i] = pos;
            i++;
        }
    }
}

- saurabh.kg001 · Answer 4

我认为这个问题与使用有限状态计算模型匹配字符串的想法有关。

可以通过使用KMP字符串匹配算法来解决此问题。

KMP算法尝试在文本字符串中查找模式字符串的匹配项，考虑到即使在某些点上发现不匹配，仍然匹配了多少前缀。

为了确定“如果我们在匹配到模式索引i之后遇到不匹配，还能匹配多少前缀”，需要提前构建一个失败函数。（请参考下面的代码以获取构建失败函数值的方法）

对于每个模式索引i，该失败函数将告诉我们，即使在索引i之后遇到不匹配，模式的多少前缀仍然可以匹配。

思路是找出模式的最长合适前缀的长度，该前缀也是其每个子字符串（由1到i索引表示，其中i从1到n）的后缀。

我使用字符串索引从1开始。

因此，任何模式的第一个字符的失败函数值为0。（即到目前为止没有匹配任何字符）。

对于后续字符，对于每个索引i = 2到n，我们看一下pattern[1...i]子串的最长合适前缀是该子串的哪个后缀。

假设我们的模式是"aac"，则索引1的失配函数值为0（尚未匹配），索引2的失配函数值为1（与“aa”的最长合适前缀相同的最长合适后缀长度为1）

对于模式"ababac"，索引1的失配函数值为0，索引2的失配函数值为0，索引3的失配函数值为1（因为第三个索引'a'与第一个索引'a'相同），索引4的失配函数值为2（因为在索引1和2处的“ab”与在索引3和4处的“ab”相同），而索引5的失配函数值为3（在索引[1...3]处的“aba”与在索引[3...5]处的“aba”相同）。对于索引6，失配函数值为0。

以下是构建失配函数并使用它匹配文本（或流）的代码（C++）：

/* Assuming that string indices start from 1 for both pattern and text. */
/* Firstly build the failure function. */
int s = 1;
int t = 0;  

/* n denotes the length of the pattern */
int *f = new int[n+1];
f[1] = 0;   

for (s = 1; s < n; s++) {
    while (t > 0 && pattern[t + 1] != pattern[s + 1]) {
        t = f[t];
    }
    if (pattern[t + 1] == pattern[s + 1]) {
        t++;
        f[s + 1] = t;
    }
    else {
        f[s + 1] = 0;           
    }
}

/* Now start reading characters from the stream */
int count = 0;
char current_char = stream.next_char();

/* s denotes the index of pattern upto which we have found match in text */
/* initially its 0 i.e. no character has been matched yet. */
s = 0; 
while (current_char != '\0') {

    /* IF previously, we had matched upto a certain number of
       characters, and then failed, we return s to the point
       which is the longest prefix that still might be matched.

       (spaces between string are just for clarity)
       For e.g., if pattern is              "a  b  a  b  a  a" 
       & its failure returning index is     "0  0  1  2  3  1"

       and we encounter 
       the text like :      "x  y  z  a  b  a  b  a  b  a  a" 
              indices :      1  2  3  4  5  6  7  8  9  10 11

       after matching the substring "a  b  a  b  a", starting at
       index 4 of text, we are successful upto index 8  but we fail
       at index 9, the next character at index 9 of text is 'b'
       but in our pattern which should have been 'a'.Thus, the index
       in pattern which has been matched till now is 5 ( a  b  a  b  a)
                                                         1  2  3  4  5
       Now, we see that the failure returning index at index 5 of 
       pattern is 3, which means that the text is still matched upto
       index 3 of pattern (a  b  a), not from the initial starting 
       index 4 of text, but starting from index 6 of text.

       Thus we continue searching assuming that our next starting index
       in text is 6, eventually finding the match starting from index 6
       upto index 11.    

       */
        while (s > 0 && current_char != pattern[s + 1]) {
            s = f[s];
        }
        if (current_char == pattern[s + 1]) s++; /* We test the next character after the currently
                                                    matched portion of pattern with the current 
                                                    character of text , if it matches, we increase
                                                    the size of our matched portion by 1*/
        if (s == n) {
            count++;
        }
        current_char = stream.next_char();
}

printf("Count is %d\n", count);

`

注意：此方法可帮助查找重叠模式出现的计数。例如，在流“ababa”中，单词“aba”出现了两次。

- Sorin · Answer 5

他们可能在寻找的是Rabin-Karp或Knuth-Morris-Pratt算法。这两种算法只需要一次遍历，开销非常小。如果模式很大，从速度上来看，它们将是明显的优胜者，因为复杂度是O(stream_length)。

Rabbin-Karp算法依赖于哈希，在下一个字符中可以更新O(1)。如果哈希不太好或流非常长（哈希冲突），则可能会给出错误的结果。

Knuth-Morris-Pratt算法依赖于计算每个位置的最长前缀和后缀的长度。这需要O(n)的内存来存储这些结果，但仅此而已。

在维基百科的字符串模式匹配页面上查找更多详细信息和实现方法。