从字符串中去除标点的最佳方法

Question

从字符串中去除标点的最佳方法

829

看起来应该有比这更简单的方法：

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有吗？

- Lawrence Johnston

4

我觉得这很简单明了。你为什么想要改变它？如果你想让它更容易，只需将你刚刚写的内容放入一个函数中即可。 - Hannes Ovrén

3

嗯，似乎使用 str.translate 的副作用来完成工作有点笨拙。我认为可能会有类似于 str.strip(chars) 的更好的方法来处理整个字符串而不仅仅是我错过的边界部分。 - Redwood

64

取决于你所指的标点符号。"The temperature in the O'Reilly & Arbuthnot-Smythe server's main rack is 40.5 degrees." 包含了一个标点符号，即第二个句点。请注意不要改变原意。 - John Machin

43

我很惊讶没有人提到 string.punctuation 根本不包括非英文标点符号。我在想的是“。”、“！”， “？”、“：”、“×”、““”、“””、〟等等。 - Clément

2

@JohnMachin 你忘了 ' ' 是标点符号。 - Wayne Werner

显示剩余8条评论

32个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Vuong · Answer 1

这可能不是最佳解决方案，但这就是我所做的方式。

import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])

- Tim P · Answer 2

以下是Python 3.5的一行代码：

import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))

- Dr.Tautology · Answer 3

这是我写的一个函数。虽然它不太高效，但它很简单，你可以添加或删除你想要的任何标点符号：

def stripPunc(wordList):
    """Strips punctuation from list of words"""
    puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
    for punc in puncList:
        for word in wordList:
            wordList=[word.replace(punc,'') for word in wordList]
    return wordList

- Haythem HADHAB · Answer 4

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)

- krinker · Answer 5

更新一下，我把 @Brian 的示例用 Python 3 重写了，并对其进行了更改，将正则表达式编译步骤移动到函数内部。我的想法是计时使函数正常工作所需的每个步骤。也许您正在使用分布式计算，不能在工作进程之间共享正则表达式对象，需要在每个工作器中进行 re.compile 步骤。此外，我还好奇地计时了 Python 3 中 maketrans 的两个不同实现。

table = str.maketrans({key: None for key in string.punctuation})

vs

table = str.maketrans('', '', string.punctuation)

我增加了一种使用set的方法，利用交集函数减少迭代次数。

这是完整的代码:

import re, string, timeit

s = "string. With. Punctuation"


def test_set(s):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in s if ch not in exclude)


def test_set2(s):
    _punctuation = set(string.punctuation)
    for punct in set(s).intersection(_punctuation):
        s = s.replace(punct, ' ')
    return ' '.join(s.split())


def test_re(s):  # From Vinko's solution, with fix.
    regex = re.compile('[%s]' % re.escape(string.punctuation))
    return regex.sub('', s)


def test_trans(s):
    table = str.maketrans({key: None for key in string.punctuation})
    return s.translate(table)


def test_trans2(s):
    table = str.maketrans('', '', string.punctuation)
    return(s.translate(table))


def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s


print("sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2      :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))

这是我的结果：

sets      : 3.1830138750374317
sets2      : 2.189873124472797
regex     : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace   : 4.579746678471565

- Dom Grey · Answer 6

在不是非常严格的情况下，一条简短的命令可能会有所帮助：

''.join([c for c in s if c.isalnum() or c.isspace()])

- Pablo Rodriguez Bertorello · Answer 7

>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)


['string', 'With', 'Punctuation']

- aloha · Answer 8

我正在寻找一个非常简单的解决方案。以下是我得到的：

import re 

s = "string. With. Punctuation?" 
s = re.sub(r'[\W\s]', ' ', s)

print(s)
'string  With  Punctuation '

- ngub05 · Answer 9

这里有一个不需要正则表达式的解决方案。

import string

input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))    
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()

Output>> where and or then

用空格替换标点符号
用单个空格替换单词之间的多个空格
使用strip()函数删除末尾的空格（如果有）

- Dehua Li · Answer 10

Why none of you use this?

 ''.join(filter(str.isalnum, s))

速度太慢了吗？