在Python中从Unicode字符串中删除标点符号的最快方法

21

我正在尝试高效地从Unicode字符串中去除标点符号。对于普通字符串,使用mystring.translate(None, string.punctuation)显然是最快的方法。然而,在Python 2.7中,这段代码无法处理Unicode字符串。正如这个答案中的评论所解释的那样,可以仍然使用translate方法,但必须使用字典来实现。然而,当我使用这个实现时,我发现translate的性能大大降低。以下是我的计时代码(主要是从这个答案中复制过来的):

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = "For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."
su = u"For me, Reddit brings to mind Obi Wan’s enduring description of the Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one you still kinda want to hang out in occasionally. The thing is, though, Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a huge watering hole at the very center of it. The site had some 400 million unique visitors in 2012. They can’t all be Greedos. So maybe my problem is just that I’ve never been able to find the places where the decent people hang out."


exclude = set(string.punctuation)
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(None, string.punctuation)

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
                      if unicodedata.category(unichr(i)).startswith('P'))

def test_trans_unicode(su):
    return su.translate(tbl)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

print "sets (unicode)      :",timeit.Timer('f(su)', 'from __main__ import su,test_set as f').timeit(1000000)
print "regex (unicode)     :",timeit.Timer('f(su)', 'from __main__ import su,test_re as f').timeit(1000000)
print "translate (unicode) :",timeit.Timer('f(su)', 'from __main__ import su,test_trans_unicode as f').timeit(1000000)
print "replace (unicode)   :",timeit.Timer('f(su)', 'from __main__ import su,test_repl as f').timeit(1000000)

根据我的结果显示,translate的unicode实现表现非常糟糕:

sets      : 38.323941946
regex     : 6.7729549408
translate : 1.27428412437
replace   : 5.54967689514

sets (unicode)      : 43.6268708706
regex (unicode)     : 7.32343912125
translate (unicode) : 54.0041439533
replace (unicode)   : 17.4450061321

我的问题是是否有一种更快的实现Unicode翻译(或任何其他方法),可以胜过正则表达式。


快速查看C源代码,stringobject.cunicodeobject.c之间的translate内置函数确实有非常不同的实现方式。 - qwwqwwq
你可能可以加速它。有一堆函数调用主要是为了清晰明了,这些可以内联。实现的主要问题是Unicode字符更加密集,而且可能替换的数量更多(你的“tbl”包含585个字符),这就需要在“unicodeobject”中使用的映射策略。正则表达式方法太慢了吗? - beerbajay
我甚至没有考虑C实现,我只是想知道是否存在能够超越正则表达式方法的Python代码。正则表达式方法可以胜任,这是我目前实现的方法,但它比较慢,慢了五倍,而且我有很多文本需要处理,所以我觉得问一下也无妨。 - Michael
如果您先将Unicode字符串转换为普通字符串,然后再对转换后的字符串使用translate,总耗时是多少? - Dyrborg
1
def test_trans_unicode_convert(su): return str(su).translate(None, string.punctuation) 会出现错误: UnicodeEncodeError: 'ascii' codec can't encode characters in position 37-39: ordinal not in range(128)。由于某种原因,Wired 显示的撇号与单引号不同。在实践中,如果我尝试这种方法,我的数据集的一部分将会发生这种情况。 - Michael
1个回答

6
当前的测试脚本存在缺陷,因为它没有进行同类比较。为了更公平的比较,所有函数必须使用相同的标点符号集运行(即全部是ASCII或全部是Unicode)。当这样做时,使用完整的Unicode标点符号集时,正则表达式和替换方法表现得更差。对于完整的Unicode,看起来“set”方法是最好的。但是,如果您只想从Unicode字符串中删除ASCII标点符号,则最好编码、翻译和解码(取决于输入字符串的长度)。在尝试替换之前,通过包含性测试也可以显著改善“replace”方法(取决于字符串的确切组成)。以下是测试脚本重新哈希后的一些示例结果:
$ python2 test.py
running ascii punctuation test...
using byte strings...

set: 0.862006902695
re: 0.17484498024
trans: 0.0207080841064
enc_trans: 0.0206489562988
repl: 0.157525062561
in_repl: 0.213351011276

$ python2 test.py a
running ascii punctuation test...
using unicode strings...

set: 0.927773952484
re: 0.18892288208
trans: 1.58275294304
enc_trans: 0.0794939994812
repl: 0.413739919662
in_repl: 0.249747991562

python2 test.py u
running unicode punctuation test...
using unicode strings...

set: 0.978360176086
re: 7.97941994667
trans: 1.72471117973
enc_trans: 0.0784001350403
repl: 7.05612301826
in_repl: 3.66821289062

这里是经过重新修改的脚本:

# -*- coding: utf-8 -*-

import re, string, timeit
import unicodedata
import sys


#String from this article www.wired.com/design/2013/12/find-the-best-of-reddit-with-this-interactive-map/

s = """For me, Reddit brings to mind Obi Wan’s enduring description of the Mos
Eisley cantina: a wretched hive of scum and villainy. But, you know, one you
still kinda want to hang out in occasionally. The thing is, though, Reddit
isn’t some obscure dive bar in a remote corner of the universe—it’s a huge
watering hole at the very center of it. The site had some 400 million unique
visitors in 2012. They can’t all be Greedos. So maybe my problem is just that
I’ve never been able to find the places where the decent people hang out."""

su = u"""For me, Reddit brings to mind Obi Wan’s enduring description of the
Mos Eisley cantina: a wretched hive of scum and villainy. But, you know, one
you still kinda want to hang out in occasionally. The thing is, though,
Reddit isn’t some obscure dive bar in a remote corner of the universe—it’s a
huge watering hole at the very center of it. The site had some 400 million
unique visitors in 2012. They can’t all be Greedos. So maybe my problem is
just that I’ve never been able to find the places where the decent people
hang out."""

def test_trans(s):
    return s.translate(tbl)

def test_enc_trans(s):
    s = s.encode('utf-8').translate(None, string.punctuation)
    return s.decode('utf-8')

def test_set(s): # with list comprehension fix
    return ''.join([ch for ch in s if ch not in exclude])

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_repl(s):  # From S.Lott's solution
    for c in punc:
        s = s.replace(c, "")
    return s

def test_in_repl(s):  # From S.Lott's solution, with fix
    for c in punc:
        if c in s:
            s = s.replace(c, "")
    return s

txt = 'su'
ptn = u'[%s]'

if 'u' in sys.argv[1:]:
    print 'running unicode punctuation test...'
    print 'using unicode strings...'
    punc = u''
    tbl = {}
    for i in xrange(sys.maxunicode):
        char = unichr(i)
        if unicodedata.category(char).startswith('P'):
            tbl[i] = None
            punc += char
else:
    print 'running ascii punctuation test...'
    punc = string.punctuation
    if 'a' in sys.argv[1:]:
        print 'using unicode strings...'
        punc = punc.decode()
        tbl = {ord(ch):None for ch in punc}
    else:
        print 'using byte strings...'
        txt = 's'
        ptn = '[%s]'
        def test_trans(s):
            return s.translate(None, punc)
        test_enc_trans = test_trans

exclude = set(punc)
regex = re.compile(ptn % re.escape(punc))

def time_func(func, n=10000):
    timer = timeit.Timer(
        'func(%s)' % txt,
        'from __main__ import %s, test_%s  as func' % (txt, func))
    print '%s: %s' % (func, timer.timeit(n))

print
time_func('set')
time_func('re')
time_func('trans')
time_func('enc_trans')
time_func('repl')
time_func('in_repl')

+1 @ekhumoro,我睡不着 :-) 我在想你的评论,你是对的,问题开发得不正确,你的答案是正确的(在我看来),所以我把我的删除了。 - Roberto
@RobertoSánchez。谢谢!希望这不会给你带来噩梦;-) - ekhumoro

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接