我找到了几个相关话题,并发现了这个解决方案:
sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)
这应该除去除了“'”以外的所有标点符号,但问题是它还会从句子中删除其他所有内容。
示例:
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'
当然,我想保留这个没有标点符号的句子,并且"warhol's"保持原样。
"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"
编辑:
我还尝试使用
tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
if unicodedata.category(unichr(i)).startswith('P'))
sentence = sentence.translate(tbl)
但这会去除所有标点符号。
regex
模块构造不是很熟悉。 - Martijn Pieters