Python除撇号外,如何从Unicode字符串中删除标点符号

13

我找到了几个相关话题,并发现了这个解决方案:

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

这应该除去除了“'”以外的所有标点符号,但问题是它还会从句子中删除其他所有内容。

示例:

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

当然,我想保留这个没有标点符号的句子,并且"warhol's"保持原样。

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

编辑:

我还尝试使用

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

但这会去除所有标点符号。


在这里(https://dev59.com/NGEi5IYBdhLWcg3whcgg)它说应该删除除连字符以外的所有标点符号。 - KameeCoding
哎呀,你说得对;我对新的 regex 模块构造不是很熟悉。 - Martijn Pieters
1个回答

17

请指定所有您不希望被删除的元素,例如\w\d\s等。这就是方括号中使用的^运算符的含义。(匹配除...之外的任何内容)

>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>> 

这个可以用于撇号,我如何添加更多的例外?比如破折号或问号之类的? - KameeCoding
just add \- to the ur".. - C.B.

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接