从字符串中去除标点的最佳方法

Question

从字符串中去除标点的最佳方法

829

看起来应该有比这更简单的方法：

import string
s = "string. With. Punctuation?" # Sample string 
out = s.translate(string.maketrans("",""), string.punctuation)

有吗？

- Lawrence Johnston

4

我觉得这很简单明了。你为什么想要改变它？如果你想让它更容易，只需将你刚刚写的内容放入一个函数中即可。 - Hannes Ovrén

3

嗯，似乎使用 str.translate 的副作用来完成工作有点笨拙。我认为可能会有类似于 str.strip(chars) 的更好的方法来处理整个字符串而不仅仅是我错过的边界部分。 - Redwood

64

取决于你所指的标点符号。"The temperature in the O'Reilly & Arbuthnot-Smythe server's main rack is 40.5 degrees." 包含了一个标点符号，即第二个句点。请注意不要改变原意。 - John Machin

43

我很惊讶没有人提到 string.punctuation 根本不包括非英文标点符号。我在想的是“。”、“！”， “？”、“：”、“×”、““”、“””、〟等等。 - Clément

2

@JohnMachin 你忘了 ' ' 是标点符号。 - Wayne Werner

显示剩余8条评论

32个回答

200

如果你了解正则表达式，那么它们就足够简单。

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

- Eratosthenes

4

替换非单词字符或空格为一个空字符串。但要小心，\w 通常也会匹配下划线。 - Matthias

6

@SIslam 我认为使用unicode标志会起作用，例如s = re.sub(r'[^\w\s]','',s, re.UNICODE)。在Linux上使用Python 3测试时，即使没有使用该标志，也可以使用泰米尔文字母 தமிழ்。 - Matthias

1

@Matthias 我在Mac上使用Python 3.6.5尝试了这段代码，输出的泰米尔文字看起来有些不同，输入的தமிழ்变成了தமழ。我对泰米尔语一无所知，不确定这是否是预期结果。 - shiouming

1

@Matthias 当处理UNICODE孟加拉文本时，它会在单词边界上产生混淆，并给出错误的单词，无论是否使用UNICODE标志。 - hafiz031

90

为了方便使用，我总结了Python 2和Python 3中从字符串中去除标点符号的注释。详细描述请参考其他答案。

Python 2

import string

s = "string. With. Punctuation?"
table = string.maketrans("","")
new_s = s.translate(table, string.punctuation)      # Output: string without punctuation

Python 3

import string

s = "string. With. Punctuation?"
table = str.maketrans(dict.fromkeys(string.punctuation))  # OR {key: None for key in string.punctuation}
new_s = s.translate(table)                          # Output: string without punctuation

- SparkAndShine

有趣的是，这个解决方案（特别是 OR {key: None for...} 选项）允许控制您想要插入标点符号的内容，这可能是空格（对于此用途，请使用 key: " " 而不是 key: None）。 - Pablo

52

myString.translate(None, string.punctuation)

- pyrou

4

嗯，我尝试过这个方法，但并不适用于所有情况。使用 myString.translate(string.maketrans("",""), string.punctuation) 却能很好地工作。 - Aidan Kane

12

请注意，在Python 3中的str和Python 2中的unicode中，不支持deletechars参数。 - agf

4

myString.translate(string.maketrans("",""), string.punctuation)不能用于unicode字符串（我吃了亏才知道）。 - Marc Maxmeister

58

类型错误：translate（）接受一个参数（给出2个）。 :( - Brian Tingle

3

请看我评论中的 Python 3 代码（它传递了一个参数）@BrianTingle。点击链接查看可处理 Unicode 的 Python 2 代码，以及它的 Python 3 改编版。 - jfs

显示剩余2条评论

35

string.punctuation 只包含 ASCII 字符！更正确（但也更慢）的方法是使用 unicodedata 模块：

# -*- coding: utf-8 -*-
from unicodedata import category
s = u'String — with -  «punctation »...'
s = ''.join(ch for ch in s if category(ch)[0] != 'P')
print 'stripped', s

您可以进行泛化并剥离其他类型的字符：

''.join(ch for ch in s if category(ch)[0] not in 'SP')

根据不同的观点，它还会剥离像~*+§$这样的字符，这些可能是或可能不是“标点符号”。

- Björn Lindqvist

4

你可以使用 regex.sub(ur"\p{P}+", "", text)： - jfs

1

不幸的是，像 ~ 这样的字符并不属于标点符号类别。您还需要测试符号类别。 - C.J. Jackson

34

并不一定更简单，但如果您更熟悉re系列，这是一种不同的方法。

import re, string
s = "string. With. Punctuation?" # Sample string 
out = re.sub('[%s]' % re.escape(string.punctuation), '', s)

- Vinko Vrsalovic

1

这段代码之所以能够正常工作，是因为string.punctuation中的字符序列是按照正确的、升序、无间隔、ASCII顺序排列的，其中包含了","和"-"。虽然Python在这方面做得很好，但当你尝试使用string.punctuation的子集时，由于意外的"-"符号，它可能会成为一个阻碍因素。 - S.Lott

2

实际上，这仍然是错误的。序列“\]”被视为转义（巧合地没有关闭"]"，因此绕过了另一个错误），但未转义\。您应该使用re.escape(string.punctuation)来防止这种情况发生。 - Brian

1

是的，我省略了它，因为在示例中它能够正常工作并保持简单，但您说得对，它应该被纳入考虑。 - Vinko Vrsalovic

32

我通常使用这样的东西：

>>> s = "string. With. Punctuation?" # Sample string
>>> import string
>>> for c in string.punctuation:
...     s= s.replace(c,"")
...
>>> s
'string With Punctuation'

- S.Lott

2

一个丑陋的一行代码：reduce(lambda s,c: s.replace(c, ''), string.punctuation, s)。 - jfs

1

很好，但是它不能删除一些标点符号，比如较长的连字符。 - Vladimir Stazhilov

16

对于Python 3的str或Python 2的unicode值，str.translate()仅接受字典作为参数；该映射将查找代码点（整数），并删除任何映射到None的内容。

因此，要删除（一些？）标点，请使用：

import string

remove_punct_map = dict.fromkeys(map(ord, string.punctuation))
s.translate(remove_punct_map)

dict.fromkeys() 类方法可以轻松地创建映射字典，将所有的值根据键序列设置为 None。

要删除所有标点符号，而不仅仅是 ASCII 标点符号，您的表格需要稍微大一些。请参见 J.F. Sebastian 的答案（Python 3 版本）：

import unicodedata
import sys

remove_punct_map = dict.fromkeys(i for i in range(sys.maxunicode)
                                 if unicodedata.category(chr(i)).startswith('P'))

- Martijn Pieters

为了支持Unicode，string.punctuation是不够的。请参见我的回答。 - jfs

@J.F.Sebastian：确实，我的答案只是使用了与得票最高的答案相同的字符。添加了您表格的Python 3版本。 - Martijn Pieters

最高票答案仅适用于ASCII字符串。您的答案明确声明了Unicode支持。 - jfs

1

@J.F.Sebastian：它适用于Unicode字符串。它会剥离ASCII标点符号。我从未声称它会剥离所有标点符号。 :-) 关键是提供正确的技术方法，以处理unicode对象和Python 2 str对象。 - Martijn Pieters

15

string.punctuation 没有包含现实生活中常用的许多标点符号。那么，有没有一种适用于非ASCII标点符号的解决方案呢？

import regex
s = u"string. With. Some・Really Weird、Non？ASCII。 「（Punctuation）」?"
remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)
remove.sub(u" ", s).strip()

个人认为这是从Python字符串中删除标点的最佳方法，因为：

它删除了所有Unicode标点符号
它很容易修改，例如，如果你想删除标点符号但保留像$这样的符号，可以删除\{S}。
您可以对要保留和要删除的内容进行非常具体的指定，例如\{Pd}仅会删除破折号。
此正则表达式还会规范化空格。它将制表符、回车符和其他奇怪字符映射为漂亮的单个空格。

这使用Unicode字符属性，您可以在维基百科上阅读更多信息。

- Zach

1

这行代码实际上不起作用：remove = regex.compile(ur'[\p{C}|\p{M}|\p{P}|\p{S}|\p{Z}]+', regex.UNICODE)。 - John Stud

@JohnStud 因为现在所有字符串都默认支持Unicode，所以在Python 3的更新版本中会出现问题。可以将第2行、第3行和第4行中的“u”删除，这样就可以正常运行了。 - Joel Wigton

12

我还没有看到这个答案。只需使用正则表达式;它会删除除单词字符(\w)和数字字符(\d)以外的所有字符，后跟一个空格字符(\s)：

import re
s = "string. With. Punctuation?" # Sample string 
out = re.sub(ur'[^\w\d\s]+', '', s)

- Blairg23

3

\d在\w中已经包含，因此是多余的。 - blhsing

数字字符被认为是单词字符的子集吗？我认为单词字符是可以构成实际单词的任何字符，例如a-zA-Z？ - Blairg23

是的，在正则表达式中，“word”包括字母、数字和下划线。请参阅文档中\w的描述：https://docs.python.org/3/library/re.html - blhsing

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Brian · Accepted Answer

从效率的角度来看，你不会超越

s.translate(None, string.punctuation)

对于更高版本的Python，请使用以下代码：

s.translate(str.maketrans('', '', string.punctuation))

它在C中使用查找表执行原始字符串操作 - 没有什么比编写自己的C代码更好了。

如果速度不是问题，另一个选择是：

exclude = set(string.punctuation)
s = ''.join(ch for ch in s if ch not in exclude)

这种方法比逐个字符使用s.replace更快，但性能不如正则表达式或string.translate等非纯Python方法，可以从下面的时间对比中看出。对于这种类型的问题，尽可能地在尽可能低的层次上解决会更好。

时间代码:

import re, string, timeit

s = "string. With. Punctuation"
exclude = set(string.punctuation)
table = string.maketrans("","")
regex = re.compile('[%s]' % re.escape(string.punctuation))

def test_set(s):
    return ''.join(ch for ch in s if ch not in exclude)

def test_re(s):  # From Vinko's solution, with fix.
    return regex.sub('', s)

def test_trans(s):
    return s.translate(table, string.punctuation)

def test_repl(s):  # From S.Lott's solution
    for c in string.punctuation:
        s=s.replace(c,"")
    return s

print "sets      :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000)
print "regex     :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000)
print "translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000)
print "replace   :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000)

这将产生以下结果：

sets      : 19.8566138744
regex     : 6.86155414581
translate : 2.12455511093
replace   : 28.4436721802