我有一个字符串示例:happy t00 go 129.129
,我想只保留空格和字母。到目前为止,我能够想到的相当有效的方法是:
print(re.sub("\d", "", 'happy t00 go 129.129'.replace('.', '')))
但它仅适用于我的示例字符串。如何删除除字母和空格之外的所有字符?
我有一个字符串示例:happy t00 go 129.129
,我想只保留空格和字母。到目前为止,我能够想到的相当有效的方法是:
print(re.sub("\d", "", 'happy t00 go 129.129'.replace('.', '')))
但它仅适用于我的示例字符串。如何删除除字母和空格之外的所有字符?
whitelist = set('abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ')
myStr = "happy t00 go 129.129$%^&*("
answer = ''.join(filter(whitelist.__contains__, myStr))
输出:
>>> answer
'happy t go '
python -m timeit -n 100 -s
循环时,我发现这比Joel的答案快了0.0029微秒。 - Gronk- Joel CornettTimer('re.sub(r"[^a-zA-Z ]+", "", myStr)', '''import re ... myStr = 'happy t00 go 129.129' * 10''').timeit(number=1000) 0.011039972305297852 `. 我的观点是,对于样本大小为100而言,0.0029微秒绝对在正常变异范围内。
使用集合的补集:
re.sub(r'[^a-zA-Z ]+', '', 'happy t00 go 129.129')
对inspectorG4dget的方法进行轻微改进 - 从string
导入和生成器推导式:
from string import ascii_letters
allowed = set(ascii_letters + ' ')
myStr = 'happy t00 go 129.129'
answer = ''.join(l for l in myStr if l in allowed)
answer
# >>> 'happy t go '
(我让myStr变得更长,并预编译了正则表达式,使得事情变得更有趣)
import re
from string import ascii_letters, digits
myStr = 'happy t00 go 129.129'*20
allowed = set(ascii_letters + ' ')
# Generator
%timeit answer = ''.join(l for l in myStr if l in allowed)
# filter/__contains__
%timeit answer = ''.join(filter(allowed.__contains__, myStr))
# Regex
pat = re.compile(r'[^a-zA-Z ]+')
%timeit answer = re.sub(pat, '', myStr)
每次循环平均用时53微秒,标准差为6.43微秒(7次运行的平均值和标准偏差,每次循环10000次)
每次循环平均用时43.3微秒,标准差为7.48微秒(7次运行的平均值和标准偏差,每次循环10000次)
每次循环平均用时26微秒,标准差为509纳秒(7次运行的平均值和标准偏差,每次循环10000次)