Python除撇号外，如何从Unicode字符串中删除标点符号

Question

Python除撇号外，如何从Unicode字符串中删除标点符号

13

我找到了几个相关话题，并发现了这个解决方案：

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

这应该除去除了“'”以外的所有标点符号，但问题是它还会从句子中删除其他所有内容。

示例：

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

当然，我想保留这个没有标点符号的句子，并且"warhol's"保持原样。

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

编辑：

我还尝试使用

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

但这会去除所有标点符号。

- KameeCoding

在这里（https://dev59.com/NGEi5IYBdhLWcg3whcgg）它说应该删除除连字符以外的所有标点符号。 - KameeCoding

哎呀，你说得对；我对新的 regex 模块构造不是很熟悉。 - Martijn Pieters

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- C.B. · Accepted Answer

请指定所有您不希望被删除的元素，例如\w、\d、\s等。这就是方括号中使用的^运算符的含义。(匹配除...之外的任何内容)

>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>>