我正在尝试高效地从Unicode字符串中去除标点符号。对于普通字符串,使用mystring.translate(None, string.punctuation)
显然是最快的方法。然而,在Python 2.7中,这段代码无法处理Unicode字符串。正如这个答案中的评论所解释的那样,可以仍然使用translate方法,但必须使用字典来实现。然而,当我使用这个实现时,我发现translate的性能大大降低。以下是我的计时代码(主要是从这个答案中复制过来的):
import re, string, timeit
import unicodedata
import sys
#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/
s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))
def test_set(s):
return ''.join(ch for ch in s if ch not in exclude)
def test_re(s): # From Vinko's solution, with fix.
return regex.sub('', s)
def test_trans(s):
return s.translate(None, string.punctuation)
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
def test_trans_unicode(su):
return su.translate(tbl)
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print "sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)
print "sets (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)
根据我的结果显示,translate的unicode实现表现非常糟糕:
sets : 38.323941946
regex : 6.7729549408
translate : 1.27428412437
replace : 5.54967689514
sets (unicode) : 43.6268708706
regex (unicode) : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode) : 17.4450061321
我的问题是是否有一种更快的实现Unicode翻译(或任何其他方法),可以胜过正则表达式。
stringobject.c
和unicodeobject.c
之间的translate
内置函数确实有非常不同的实现方式。 - qwwqwwqUnicodeEncodeError: 'ascii' codec can't encode characters in position 37-39: ordinal not in range(128)
。由于某种原因,Wired 显示的撇号与单引号不同。在实践中,如果我尝试这种方法,我的数据集的一部分将会发生这种情况。 - Michael