您可以将黑名单合并成一个表达式:
import re
blacklist = re.compile('|'.join([re.escape(word) for word in B]))
如果匹配,则过滤掉这些单词:
C = [word for word in A if not blacklist.search(word)]
模式中的单词已经被转义(这样,元字符如.
将不会被视为元字符,而是作为字面字符对待),并被连接成一系列的|
替代选项:
>>> '|'.join([re.escape(word) for word in B])
'XXX|BBB'
演示:
>>> import re
>>> A = [ 'cat', 'doXXXg', 'monkey', 'hoBBBrse', 'fish', 'snake']
>>> B = ['XXX', 'BBB']
>>> blacklist = re.compile('|'.join([re.escape(word) for word in B]))
>>> [word for word in A if not blacklist.search(word)]
['cat', 'monkey', 'fish', 'snake']
这应该比任何显式的成员测试都要更快,特别是当你的黑名单中单词数量增加时:
>>> import string, random, timeit
>>> def regex_filter(words, blacklist):
... [word for word in A if not blacklist.search(word)]
...
>>> def any_filter(words, blacklist):
... [word for word in A if not any(bad in word for bad in B)]
...
>>> words = [''.join([random.choice(string.letters) for _ in range(random.randint(3, 20))])
... for _ in range(1000)]
>>> blacklist = [''.join([random.choice(string.letters) for _ in range(random.randint(2, 5))])
... for _ in range(10)]
>>> timeit.timeit('any_filter(words, blacklist)', 'from __main__ import any_filter, words, blacklist', number=100000)
0.36232495307922363
>>> timeit.timeit('regex_filter(words, blacklist)', "from __main__ import re, regex_filter, words, blacklist; blacklist = re.compile('|'.join([re.escape(word) for word in blacklist]))", number=100000)
0.2499098777770996
上述测试将10个随机的黑名单短词(2-5个字符)与1000个随机词(3-20个字符长)列表进行比较,正则表达式的速度快了约50%。