在Python中用另一个字符串替换单词列表中的所有单词

10

我有一个用户输入的字符串,想要搜索并替换其中出现在一个单词列表中的任何单词为我的替换字符串。

import re

prohibitedWords = ["MVGame","Kappa","DatSheffy","DansGame","BrainSlug","SwiftRage","Kreygasm","ArsonNoSexy","GingerPower","Poooound","TooSpicy"]


# word[1] contains the user entered message
themessage = str(word[1])    
# would like to implement a foreach loop here but not sure how to do it in python
for themessage in prohibitedwords:
    themessage =  re.sub(prohibitedWords, "(I'm an idiot)", themessage)

print themessage

以上代码不起作用,我确信我不理解Python中的for循环是如何工作的。


你应该尝试查看Python的SpamBayes实现,可能更具可扩展性。 - dusual
4个回答

39
您可以通过一次调用sub来实现这一点:
big_regex = re.compile('|'.join(map(re.escape, prohibitedWords)))
the_message = big_regex.sub("repl-string", str(word[1]))

例子:

>>> import re
>>> prohibitedWords = ['Some', 'Random', 'Words']
>>> big_regex = re.compile('|'.join(map(re.escape, prohibitedWords)))
>>> the_message = big_regex.sub("<replaced>", 'this message contains Some really Random Words')
>>> the_message
'this message contains <replaced> really <replaced> <replaced>'

请注意,使用str.replace可能会导致微妙的错误:

>>> words = ['random', 'words']
>>> text = 'a sample message with random words'
>>> for word in words:
...     text = text.replace(word, 'swords')
... 
>>> text
'a sample message with sswords swords'

使用 re.sub 可以得到正确的结果:

>>> big_regex = re.compile('|'.join(map(re.escape, words)))
>>> big_regex.sub("swords", 'a sample message with random words')
'a sample message with swords swords'

正如thg435所指出的,如果你想替换单词而不是每个子字符串,你可以在正则表达式中添加单词边界:
big_regex = re.compile(r'\b%s\b' % r'\b|\b'.join(map(re.escape, words)))

这将替换'random''random words'中但不会替换'pseudorandom words'

如果你需要替换很多单词,那么你必须将它分解。 - DSM
你可能想要在表达式中加上\b来避免替换“retailers”中的“tail”。 - georg
当我使用这个代码时,我得到了一个奇怪的重复字符串(整行代码会打印两次)。 - Zac
@Zac 对我来说它运行良好。你能编辑你的答案并展示你正在做什么以及你获得的输出吗? - Bakuriu
您说得对,但我认为这就是OP所要求的,替换单词而不是字符,这就是我来到这里的原因。感谢您的回答! - sergiuz
显示剩余4条评论

6

试试这个:

prohibitedWords = ["MVGame","Kappa","DatSheffy","DansGame","BrainSlug","SwiftRage","Kreygasm","ArsonNoSexy","GingerPower","Poooound","TooSpicy"]

themessage = str(word[1])    
for word in prohibitedwords:
    themessage =  themessage.replace(word, "(I'm an idiot)")

print themessage

这很脆弱:正如Bakuriu所解释的那样,当一个被禁止的词是另一个字符串的子字符串时,它很容易被破坏。 - Adam
1
@codesparkle 这并不意味着它是错误的,你总是根据特定条件选择你的选项。 - Artsiom Rudzenka

1

基于Bakariu的回答,

使用re.sub更简单的方法如下。

words = ['random', 'words']
text = 'a sample message with random words'

new_sentence = re.sub("random|words", "swords", text)

输出结果为“带有剑的样本消息”。

0

代码:

prohibitedWords =["MVGame","Kappa","DatSheffy","DansGame",
                  "BrainSlug","SwiftRage","Kreygasm",
                  "ArsonNoSexy","GingerPower","Poooound","TooSpicy"]
themessage = 'Brain'   
self_criticism = '(I`m an idiot)'
final_message = [i.replace(themessage, self_criticism) for i in prohibitedWords]
print final_message

结果:

['MVGame', 'Kappa', 'DatSheffy', 'DansGame', '(I`m an idiot)Slug', 'SwiftRage',
'Kreygasm', 'ArsonNoSexy', 'GingerPower', 'Poooound','TooSpicy']

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接