看起来应该有比这更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有吗?
看起来应该有比这更简单的方法:
import string
s = "string. With. Punctuation?" # Sample string
out = s.translate(string.maketrans("",""), string.punctuation)
有吗?
import string
f = lambda x: ''.join([i for i in x if i not in string.punctuation])
string.punctuation
是否包含了所有可能的Unicode标点符号。 - ingyhere以下是Python 3.5的一行代码:
import string
"l*ots! o(f. p@u)n[c}t]u[a'ti\"on#$^?/".translate(str.maketrans({a:None for a in string.punctuation}))
这是我写的一个函数。虽然它不太高效,但它很简单,你可以添加或删除你想要的任何标点符号:
def stripPunc(wordList):
"""Strips punctuation from list of words"""
puncList = [".",";",":","!","?","/","\\",",","#","@","$","&",")","(","\""]
for punc in puncList:
for word in wordList:
wordList=[word.replace(punc,'') for word in wordList]
return wordList
punctlist
可以只是一个字符串。 - Nicimport re
s = "string. With. Punctuation?" # Sample string
out = re.sub(r'[^a-zA-Z0-9\s]', '', s)
更新一下,我把 @Brian 的示例用 Python 3 重写了,并对其进行了更改,将正则表达式编译步骤移动到函数内部。我的想法是计时使函数正常工作所需的每个步骤。也许您正在使用分布式计算,不能在工作进程之间共享正则表达式对象,需要在每个工作器中进行 re.compile
步骤。此外,我还好奇地计时了 Python 3 中 maketrans 的两个不同实现。
table = str.maketrans({key: None for key in string.punctuation})
vs
table = str.maketrans('', '', string.punctuation)
我增加了一种使用set的方法,利用交集函数减少迭代次数。
这是完整的代码:
import re, string, timeit
s = "string. With. Punctuation"
def test_set(s):
exclude = set(string.punctuation)
return ''.join(ch for ch in s if ch not in exclude)
def test_set2(s):
_punctuation = set(string.punctuation)
for punct in set(s).intersection(_punctuation):
s = s.replace(punct, ' ')
return ' '.join(s.split())
def test_re(s): # From Vinko's solution, with fix.
regex = re.compile('[%s]' % re.escape(string.punctuation))
return regex.sub('', s)
def test_trans(s):
table = str.maketrans({key: None for key in string.punctuation})
return s.translate(table)
def test_trans2(s):
table = str.maketrans('', '', string.punctuation)
return(s.translate(table))
def test_repl(s): # From S.Lott's solution
for c in string.punctuation:
s=s.replace(c,"")
return s
print("sets :",timeit.Timer('f(s)', 'from __main__ import s,test_set as f').timeit(1000000))
print("sets2 :",timeit.Timer('f(s)', 'from __main__ import s,test_set2 as f').timeit(1000000))
print("regex :",timeit.Timer('f(s)', 'from __main__ import s,test_re as f').timeit(1000000))
print("translate :",timeit.Timer('f(s)', 'from __main__ import s,test_trans as f').timeit(1000000))
print("translate2 :",timeit.Timer('f(s)', 'from __main__ import s,test_trans2 as f').timeit(1000000))
print("replace :",timeit.Timer('f(s)', 'from __main__ import s,test_repl as f').timeit(1000000))
sets : 3.1830138750374317
sets2 : 2.189873124472797
regex : 7.142953420989215
translate : 4.243278483860195
translate2 : 2.427158243022859
replace : 4.579746678471565
在不是非常严格的情况下,一条简短的命令可能会有所帮助:
''.join([c for c in s if c.isalnum() or c.isspace()])
>>> s = "string. With. Punctuation?"
>>> s = re.sub(r'[^\w\s]','',s)
>>> re.split(r'\s*', s)
['string', 'With', 'Punctuation']
我正在寻找一个非常简单的解决方案。以下是我得到的:
import re
s = "string. With. Punctuation?"
s = re.sub(r'[\W\s]', ' ', s)
print(s)
'string With Punctuation '
这里有一个不需要正则表达式的解决方案。
import string
input_text = "!where??and!!or$$then:)"
punctuation_replacer = string.maketrans(string.punctuation, ' '*len(string.punctuation))
print ' '.join(input_text.translate(punctuation_replacer).split()).strip()
Output>> where and or then
Why none of you use this?
''.join(filter(str.isalnum, s))
速度太慢了吗?
The temperature in the O'Reilly & Arbuthnot-Smythe server's main rack is 40.5 degrees.
" 包含了一个标点符号,即第二个句点。请注意不要改变原意。 - John Machinstring.punctuation
根本不包括非英文标点符号。我在想的是“。”、“!”, “?”、“:”、“×”、““”、“””、〟等等。 - Clément' '
是标点符号。 - Wayne Werner