C语言 - 打印出现最频繁的字符串

Question

C语言 - 打印出现最频繁的字符串

3

这些天我一直在发布一些代码，因为我正在做一个练习，最后似乎完成了，但我发现它不起作用。这个练习要求输入： - N 一个整数，表示要读取的字符串数量 - K 一个整数 - N 个字符串这些字符串可以重复。输出是K个最常见的字符串，按照它们的频率排序（降序）。

例子测试集： 输入：

6
2
mickey
mouse
mickey
hello
mouse
mickey

输出:

mickey // Has freq 3
mouse // Has freq 2

我希望我能够清楚地解释这个练习，这是我的尝试。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

typedef struct _stringa {
    char* string;
    int freq;
} stringa;


int compare(const void *elem1, const void *elem2) {
    stringa *first = (stringa *)elem1;
    stringa *second = (stringa *)elem2;

    if (first->freq < second->freq) {
        return -1;
    } else if (first->freq > second->freq) {
        return 1;
    } else {
    return 0;
    }
}

int BinarySearch(stringa** array, char* string, int left, int right) {
    int middle;
    if (left==right) {
        if (strcmp(string,array[left]->string)==0) {
            return left;
        } else {
            return -1;
        }
    }
    middle = (left+right)/2;
    if ((strcmp(string,array[middle]->string)<0) || (strcmp(string,array[middle]->string)==0) ) {
        return BinarySearch(array, string, left, middle);
    } else {
        return BinarySearch(array, string, middle+1, right);
    }

}


int main (void)
{
    char value[101];
    int n = 0;
    int stop;
    scanf("%d", &n); // Number of strings
    scanf("%d", &stop); // number of the most frequent strings to print

    stringa **array = NULL;
    array = malloc ( n * sizeof (struct _stringa *) );

    int i = 0;

    for (i=0; i<n; i++) {

        array[i] = malloc (sizeof (struct _stringa));
        array[i]->string = malloc (sizeof (value)); 

        scanf("%s", value);

        int already;
        already = BinarySearch(array, value, 0, i); // With a binary search, I see if the string is present in the previous positions of the array I am occupying. If it is not present, I copy the string into the array, otherwise, I use the value of binary search (which is the position of the element in the array) and I update the frequency field


        if (already==-1) {
            strcpy(array[i]->string,value); 
            array[i]->freq = 1;
        } else {
            array[already]->freq += 1;
        }

    }


    stringa **newarray = NULL; // New struct array of strings
    newarray = malloc ( n * sizeof (struct _stringa *) );

    int k = 0;
    for (i=0; i<n; i++) { // I use this loop to copy the element that don't have a frequency == 0
        if (array[i]->freq != 0) {
            newarray[k] = malloc(sizeof(struct _stringa));
            newarray[k] = malloc(sizeof(value));
            newarray[k]->string = array[i]->string;
            newarray[k]->freq = array[i]->freq;
            k++;
        }
    }
        qsort(newarray, n, sizeof(stringa*), compare);

        i=0;
        while ((newarray[i]!= NULL) && (i<k)) {
            printf("%s ", newarray[i]->string);
            printf("%d\n", newarray[i]->freq);
            i++;
        }


// Freeing operations        

    while (--n >= 0) {
        if (array[n]->string) free (array[n]->string);
        if (array[n]) free (array[n]);
    }

    if (array) free (array);
    if (newarray) free (newarray);

    return 0;
}

感谢所有有耐心和时间阅读这段代码的人。

编辑：

我忘了补充一下它存在的问题。如果由于调试原因我不使用qsort，并且我使用以下输入作为示例： 5 2 //随机数，我仍然需要完成“打印k个字符串”的部分， hello hello hello hello hello

它会输出： hello 3 (频率) hello 2 (频率)

所以它不能正常工作。正如您在评论中建议的那样，二分搜索存在缺陷，因为它只能在有序列表上运行。我可以每次都对数组进行排序，但我认为这样做是适得其反的。如何摆脱只定位不存在于数组中的字符串的问题呢？

- Roberto

2

@EdHeal 不确定 OP 是否被允许使用 C++。 - The Paramagnetic Croissant

4

你的二分查找使用似乎有误。二分查找仅适用于已排序的列表，而这并不是当前情况。 - Lekensteyn

3

我认为你的代码在二分查找方面存在问题：首先，你将右边界设为 i，但实际上你的数组中可能没有 i 个值（也可能更少），而且你的数组没有排序，因此你的二分查找无法正常工作。 - Holt

3

@user3477950 - 这就是为什么我把它留作评论的原因。 - Ed Heal

2

@Roberto，我建议你在未来隔离出程序中不起作用的部分，并提出更具体的问题。你的代码有点太长了，风险是你的问题会变成一个琐碎的“为什么我的代码不起作用？在这里。” - 0x5C91

显示剩余9条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rob11311 · Accepted Answer

1

如果您想要一种高效的方法而不需要排序，可以使用哈希表。否则，只需将每个唯一的字符串放入数组中并进行线性扫描，简单可靠。

在现代硬件上，由于缓存和最小化间接引用，这种扫描实际上是快速的。对于少量项目，插入排序实际上比qsort的实践更有效。例如，查看“Tim sort”算法，它是稳定的并避免了qsort在几乎排序数据方面的性能问题，它混合了归并和插入排序以实现n Log n，在真实数据上没有极端情况。

- Rob11311

谢谢你，Rob！我觉得Timsort算法很有趣，感谢你的建议。几天前我开始使用哈希表，这是一个不错的练习。我正在使用它们创建一个数组，其中每个元素指向一个链表。所以你的建议是在这些列表中创建“频率”字段？ - Roberto

1

如果您使用标准的键/值查找函数（即使是POSIX hcreate/hsearch也可以），那么您只需查找键，如果找到则value++，否则value=1。Timsort很复杂，但相关的要点是，当n<=60时，插入排序非常快，并且是对数组进行简单线性搜索的自然扩展，因此如果您必须自己编写每一行代码，而不是打开到外部程序的管道，如“tr [:space:] '\n' | sort”，则是实现“让它工作，然后优化”策略的辅助工具。如果您喜欢哲学来证明它，请参见http://www.jwz.org/doc/worse-is-better.html。 - Rob11311

1

由于Tim排序算法非常复杂且较新（2002年），在计算机科学学位系统编程中，插入排序往往是实践中最好的算法，尽管快速排序具有O(NlogN)的优势；因为它在小量或几乎排序数据上非常快，而快速排序则表现不佳 :) 这个例子非常适合您的哈希表例程，然后您就可以进行O(1)查找，而您的算法是O(N)，而不是由于排序而变成O(NlogN)。在我完成的大多数实际程序中，需要哈希或索引查找，排序则较少出现，除了作为方便的最终用户输出时（当）排序速度不重要。 - Rob11311

哈哈哈，我猜你和德古拉有一个共同点，那就是你也得在晚上工作 :) - Roberto

很遗憾，我不得不让这些新手休息一下。 - Rob11311

显示剩余5条评论