给定这样的语料库/文本:
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999 , and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
Although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You have requested a debate on this subject in the course of the next few days , during this part @-@ session .
In the meantime , I should like to observe a minute ' s silence , as a number of Members have requested , on behalf of all the victims concerned , particularly those of the terrible storms , in the various countries of the European Union .
我可以这样做来获得一个单词频率字典:
>>> word_freq = Counter()
>>> for line in text.split('\n'):
... for word in line.split():
... word_freq[word]+=1
...
但如果目的是从最高到最低频率实现有序字典,我需要这样做:
>>> from collections import OrderedDict
>>> sorted_word_freq = OrderedDict()
>>> for word, freq in word_freq.most_common():
... sorted_word_freq[word] = freq
...
假设我有10亿个键在Counter
对象中,通过迭代most_common()
将会具有遍历语料库(非唯一实例)一次和词汇表(唯一键)的复杂度。
注意:Counter.most_common()
将调用一个临时的sorted()
,请参见https://hg.python.org/cpython/file/e38470b49d3c/Lib/collections.py#l472
鉴于此,我看到了以下使用numpy.argsort()
的代码:
>>> import numpy as np
>>> words = word_freq.keys()
>>> freqs = word_freq.values()
>>> sorted_word_index = np.argsort(freqs) # lowest to highest
>>> sorted_word_freq_with_numpy = OrderedDict()
>>> for idx in reversed(sorted_word_index):
... sorted_word_freq_with_numpy[words[idx]] = freqs[idx]
...
哪个更快?
有没有其他更快的方法从Counter
中获取这样的OrderedDict
?
除了OrderedDict
,还有其他Python对象可以实现相同的按键排序的键值对吗?
假设内存不是问题。给定120GB的RAM,保留10亿个键值对应该不会有太大问题,假设平均每个键有20个字符,而每个值只有一个整数。