如何使用Python内置的map和reduce函数计算字符串中字母出现频率

3
我希望你能够使用Python的map和reduce内置函数计算字符串中字母的频率。请问有谁可以提供一些关于如何实现的见解吗? 目前我已经得到了以下代码:
s = "the quick brown fox jumped over the lazy dog"

# Map function
m = lambda x: (x,1)

# Reduce
# Add the two frequencies if they are the same
# else.... Not sure how to put both back in the list
# in the case where they are not the same.
r = lambda x,y: (x[0], x[1] + y[1]) if x[0] == y[0] else ????

freq = reduce(r, map(m, s))

当所有字母相同时,这个非常有效。

>>> s
'aaaaaaa'
>>> map(m, s)
[('a', 1), ('a', 1), ('a', 1), ('a', 1), ('a', 1), ('a', 1), ('a', 1)]
>>> reduce(r, map(m, s))
('a', 7)

当有不同的字母时,我该如何使其更好地运行?

4个回答

阿里云服务器只需要99元/年,新老用户同享,点击查看详情
4

暂时不考虑你的代码问题,我想指出常用且最快的计数方法之一是使用collections模块中的Counter类。以下是Python 2.7.3解释器中使用它的示例:

>>> from collections import Counter
>>> lets=Counter('aaaaabadfasdfasdfafsdff')
>>> lets
Counter({'a': 9, 'f': 6, 'd': 4, 's': 3, 'b': 1})
>>> s = "the quick brown fox jumped over the lazy dog"
>>> Counter(s)
Counter({' ': 8, 'e': 4, 'o': 4, 'd': 2, 'h': 2, 'r': 2, 'u': 2, 't': 2, 'a': 1, 'c': 1, 'b': 1, 'g': 1, 'f': 1, 'i': 1, 'k': 1, 'j': 1, 'm': 1, 'l': 1, 'n': 1, 'q': 1, 'p': 1, 'w': 1, 'v': 1, 'y': 1, 'x': 1, 'z': 1})
使用reduce,定义一个辅助函数addto(oldtotal,newitem),将newitem添加到oldtotal并返回一个新的总数。总数的初始化器是一个空字典{}。这是一个解释性的例子。请注意,get()的第二个参数是当键尚未在字典中时要使用的默认值。
 >>> def addto(d,x):
...     d[x] = d.get(x,0) + 1
...     return d
... 
>>> reduce (addto, s, {})
{' ': 8, 'a': 1, 'c': 1, 'b': 1, 'e': 4, 'd': 2, 'g': 1, 'f': 1, 'i': 1, 'h': 2, 'k': 1, 'j': 1, 'm': 1, 'l': 1, 'o': 4, 'n': 1, 'q': 1, 'p': 1, 'r': 2, 'u': 2, 't': 2, 'w': 1, 'v': 1, 'y': 1, 'x': 1, 'z': 1}
下面显示的代码打印了多个方法每次执行1000次的执行时间。在旧的AMD Athlon 5000+ Linux 3.2.0-32 Ubuntu 12系统上执行时,使用两个不同的字符串s,它打印出:
String length is 44   Pass count is 1000
horsch1 : 0.77517914772
horsch2 : 0.778718948364
jreduce : 0.0403778553009
jcounter: 0.0699260234833
String length is 4931   Pass count is 100
horsch1 : 8.25176692009
horsch2 : 8.14318394661
jreduce : 0.260674953461
jcounter: 0.282369852066
(reduce方法比Counter方法稍微快一些。)以下是计时代码,它使用timeit模块。在这里的代码中,timeit.Timer的第一个参数是要重复计时的代码,第二个参数是设置代码。
import timeit
from collections import Counter
passes = 1000

m1 = lambda x: [int(ord(x) == i) for i in xrange(65,91)]

def m2(x):
    return [int(ord(x) == i) for i in xrange(65,91)]

def es1(s):
    add = lambda x,y: [x[i]+y[i] for i in xrange(len(x))]
    freq = reduce(add,map(m1, s.upper()))
    return freq

def es2(s):
    add = lambda x,y: [x[i]+y[i] for i in xrange(len(x))]
    freq = reduce(add,map(m2, s.upper()))
    return freq

def addto(d,x):
    d[x] = d.get(x,0) + 1
    return d

def jwc(s):
    return Counter(s)

def jwr(s):
    return reduce (addto, s, {})

s = "the quick brown fox jumped over the lazy dog"
print 'String length is',len(s), '  Pass count is',passes
print "horsch1 :",timeit.Timer('f(s)', 'from __main__ import s, m1,     es1 as f').timeit(passes)
print "horsch2 :",timeit.Timer('f(s)', 'from __main__ import s, m2,     es2 as f').timeit(passes)
print "jreduce :",timeit.Timer('f(s)', 'from __main__ import s, addto,  jwr as f').timeit(passes)
print "jcounter:",timeit.Timer('f(s)', 'from __main__ import s, Counter,jwc as f').timeit(passes)

你的addto解决方案很好。我真的很喜欢它。 - Sakara
我曾试图在lambda内部使用一些肮脏的东西来完成它 - 我想,跳出思维定势是更好的选择 :) 很棒的解决方案,+1。 - RocketDonkey
出于好奇,您的addto(d,x)解决方案的效率如何与我下面编写的解决方案相比? - emschorsch
@emschorsch,请参见编辑。您可以更改定时代码以查看时间去哪里。 - James Waldby - jwpat7
哇!感谢您说明了我的方法有多慢。我很难想出一种使用map和reduce的方法,所以我认为我的代码很好,只是因为它看起来相当简洁。但如果它比那慢那么多,那就不重要了。 - emschorsch

0

ord() 通常返回 ASCII 码。我的方法计算字母的频率,其中每个索引对应于字母在字母表中的位置。由于您将字符串转换为大写,因此该方法不区分大小写。

s = "the quick brown fox jumped over the lazy dog"

# Map function
m = lambda x: [ord(x) == i for i in xrange(0,26)]

add = lambda x,y: [x[i]+y[i] for i in xrange(len(x))]
freq = reduce(add,map(m, s.upper()))

如果您使用[x == i for i in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']替换[int(ord(x) == i) for i in xrange(65,91)],则运行时间将减少2/3。(还请注意add=...这一行缺少']') - James Waldby - jwpat7
我不知道在Python中可以添加布尔值并获得整数总和。为什么for i in 'ALPHABET'for i in xrange(0,25)更快? - emschorsch
我不知道具体的实现细节,但是可以想象在迭代字符串时,可能会采用一些降低开销的方法(例如保存更少的上下文)。也许 int(ord(x) == i) 更为重要。在编译型语言中,int(ord(x) == i)x == i 具有相同的底层代码。但在 Python 中,执行 int 和 ord 需要时间。 - James Waldby - jwpat7

0

你也可以使用 defaultdict

>>> from collections import defaultdict
>>> d = defaultdict(int)
>>> s = 'the quick brown fox jumped over the lazy dog'
>>> for i in s:
...    d[i] += 1
...
>>> for letter,count in d.iteritems():
...    print letter,count
...
  8 # number of spaces
a 1
c 1
b 1
e 4
d 2
g 1
f 1
i 1
h 2
k 1
j 1
m 1
l 1
o 4
n 1
q 1
p 1
r 2
u 2
t 2
w 1
v 1
y 1
x 1
z 1

0

你也可以使用 s.count 方法:

{x: s.count(x) for x in set(s)}
请注意,我使用了set(s)仅计算字符串中每个字母的频率一次。这是在我的机器上进行测试的结果:
String length is 44   Pass count is 1000
horsch1  : 0.317646980286
horsch2  : 0.325616121292
jreduce  : 0.0106990337372
jcounter : 0.0142340660095
def_dict : 0.00750803947449
just_dict: 0.00737881660461
s_count  : 0.00887513160706

String length is 4400   Pass count is 100
horsch1  : 3.24123382568
horsch2  : 3.23079895973
jreduce  : 0.0944828987122
jcounter : 0.102299928665
def_dict : 0.0341360569
just_dict: 0.0643239021301
s_count  : 0.0224709510803

这是一个测试代码:

import timeit
from collections import Counter, defaultdict
passes = 100

m1 = lambda x: [int(ord(x) == i) for i in xrange(65,91)]

def m2(x):
    return [int(ord(x) == i) for i in xrange(65,91)]

def es1(s):
    add = lambda x,y: [x[i]+y[i] for i in xrange(len(x))]
    freq = reduce(add,map(m1, s.upper()))
    return freq

def es2(s):
    add = lambda x,y: [x[i]+y[i] for i in xrange(len(x))]
    freq = reduce(add,map(m2, s.upper()))
    return freq

def addto(d,x):
    d[x] = d.get(x,0) + 1
    return d

def jwc(s):
    return Counter(s)

def jwr(s):
    return reduce (addto, s, {})

def def_dict(s):
    d = defaultdict(int)
    for i in s:
        d[i]+=1
    return d

def just_dict(s):
    freq = {}
    for i in s:
        freq[i]=freq.get(i, 0) + 1
    return freq

def s_count(s):
    return {x: s.count(x) for x in set(s)}

s = "the quick brown fox jumped over the lazy dog"*100
print 'String length is',len(s), '  Pass count is',passes
print "horsch1  :",timeit.Timer('f(s)', 'from __main__ import s, m1,     es1 as f').timeit(passes)
print "horsch2  :",timeit.Timer('f(s)', 'from __main__ import s, m2,     es2 as f').timeit(passes)
print "jreduce  :",timeit.Timer('f(s)', 'from __main__ import s, addto,  jwr as f').timeit(passes)
print "jcounter :",timeit.Timer('f(s)', 'from __main__ import s, Counter,jwc as f').timeit(passes)
print "def_dict :",timeit.Timer('f(s)', 'from __main__ import s, defaultdict, def_dict as f').timeit(passes)
print "just_dict:",timeit.Timer('f(s)', 'from __main__ import s, just_dict as f').timeit(passes)
print "s_count  :",timeit.Timer('f(s)', 'from __main__ import s, s_count as f').timeit(passes)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,