在字符串中找到最常见的字符对

Question

在字符串中找到最常见的字符对

5

我已经写了以下函数。

//O(n^2)
void MostCommonPair(char * cArr , char * ch1 , char * ch2 , int * amount)
{
    int count , max = 0;
    char cCurrent , cCurrent2;
    int i = 0 , j;
    while(*(cArr + i + 1) != '\0')
    {
        cCurrent = *(cArr + i);
        cCurrent2 = *(cArr + i + 1);
        for(j = i , count = 0 ; *(cArr + j + 1) != '\0' ; j++)
        {
            if(cCurrent ==  *(cArr + j) && cCurrent2 ==  *(cArr + j + 1))
            {
                count++;
            }
        }
        if(count > max)
        {
            *ch1 = cCurrent;
            *ch2 = cCurrent2;
            max = *amount = count;
        }
        i++;
    }
}

对于以下输入：

"xdshahaalohalobscxbsbsbs"

ch1 = b ch2 = s amount = 4

但是在我看来，该函数非常低效，是否有一种方法只遍历一次字符串或将运行时间降至O(n)？

- Daniel Jakobsen Hallel

1

请注意，OP正在寻找具有最高计数的连续字符对。 - hatchet - done with SOverflow

4个回答

4

如果您想要O(n)的运行时间，可以使用哈希表（例如，Java的HashMap）。

一次迭代字符串，每次迭代一个字符 O(n)
对于每个访问的字符，向前查找一个字符（这就是你的字符对 - 只需将它们连接起来）O(1)
对于找到的每个字符对，在哈希表中首先查找它：O(1)
- 如果它尚未在哈希表中，则将其添加为字符对键，并将int 1作为值（这计算了您在字符串中看到它的次数）。O(1)
- 如果它已经在哈希表中，则将其值加1 O(1)
在查看完字符串后，检查哈希表中计数最高的字符对。 O(m)（其中m是可能的配对数量；m <= n必然）

- sampson-chen

1

是的，您可以通过保持运行计数来以大约线性时间完成此操作。

这有帮助吗？

- Useless

0

假设你所说的“常见对”是指最常见的两个连续字符

在伪代码层面，您想要

 Read the first character into the "second character" register
 while(there is data)
    store the old second character as the new first character
    read the next character as the second one
    increment the count associated with this pair
 Select the most common pair

所以你需要的是一种高效的算法来存储和计数与字符对相关的数据，并找到最常见的一个。

- dmckee --- ex-moderator kitten

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sergey Kalinichenko · Accepted Answer

自从char可以容纳256个值，你可以设置一个256*256的二维表格，运行一遍字符串，递增对应于字符串中每个字符对的计数器。然后，您可以浏览256x256数字的表格，选择最大计数，并通过查看它在2D数组中的位置来知道它属于哪个对。由于计数器表的大小固定为独立于字符串长度的常量值，因此该操作是O（1），即使它需要两个嵌套循环。请保留HTML标记。

int count[256][256];
memset(count, 0, sizeof(count));
const char *str = "xdshahaalohalobscxbsbsbs";
for (const char *p = str ; *(p+1) ; p++) {
    count[(int)*p][(int)*(p+1)]++;
}
int bestA = 0, bestB = 0;
for (int i = 0 ; i != 256 ; i++) {
    for (int j = 0 ; j != 256 ; j++) {
        if (count[i][j] > count[bestA][bestB]) {
            bestA = i;
            bestB = j;
        }
    }
}
printf("'%c%c' : %d times\n", bestA, bestB, count[bestA][bestB]);

这里有一个关于ideone演示的链接。

请记住，尽管这是渐进最快的解决方案（即它是O(N)，你不能使它比O(N)更快），但对于较短的字符串，性能不会很好。实际上，在大约256个字符以下的输入中，您的解决方案将毫无疑问地击败它，甚至可能更多。有许多优化可以应用于此代码，但我决定不添加它们，以便保持代码的主要思想在其最纯净和最简单的形式下清晰可见。