我发现了以下的编程面试问题:
挑战1:N-gram
N-gram是从给定单词中连续N个字符组成的序列。对于单词"pilot",有三个3-gram:"pil"、"ilo"和"lot"。 对于给定的一组单词和n-gram长度, 你的任务是:
• write a function that finds the n-gram that is the most frequent one among all the words
• print the result to the standard output (stdout)
• if there are multiple n-grams having the same maximum frequency please print the one that is the smallest lexicographically (the first one according to the dictionary sorting order)
请注意,您的函数将接收以下参数:
• text
○ which is a string containing words separated by whitespaces
• ngramLength
○ which is an integer value giving the length of the n-gram
数据限制
• the length of the text string will not exceed 250,000 characters
• all words are alphanumeric (they contain only English letters a-z, A-Z and numbers 0-9)
效率限制
• your function is expected to print the result in less than 2 seconds
例子 输入文本:“aaaab a0a baaab c”
输出 aaa ngram长度:3
解释
对于上述输入,按频率排序的3元组如下:
• "aaa" with a frequency of 3
• "aab" with a frequency of 2
• "a0a" with a frequency of 1
• "baa" with a frequency of 1
如果我只有一个小时解决问题,并选择使用C语言解决:是否实现哈希表来计算N元组的频率是个好主意?因为在C库中没有哈希表的实现...
如果是,我想通过使用具有有序链表的分离链接来实现哈希表。这些实现减少了解决问题所需的时间....
这是可能的最快选项吗?
谢谢!!!