循环遍历字符串列表,从每个字符串项中删除所有禁用单词

3

我有如下列表:

dirtylist = ["lemons zested", "grated cheddar cheese", "carrots, thinly chopped"]

这是一个词汇列表,我希望从列表中的每个字符串项中删除这些词汇:
bannedWord = ['grated', 'zested', 'thinly', 'chopped', ',']

我尝试生成的结果列表如下:
cleaner_list = ["lemons", "cheddar cheese", "carrots"]

到目前为止,我未能实现这一点。我的尝试如下所示:
import re

dirtylist = ["lemons zested", "grated cheddar cheese", "carrots, thinly chopped"]
cleaner_list = []
    
def RemoveBannedWords(ing):
    pattern = re.compile("\\b(grated|zested|thinly|chopped)\\W", re.I)
    return pattern.sub("", ing)
    
for ing in dirtylist:
    cleaner_ing = RemoveBannedWords(ing)
    cleaner_list.append(cleaner_ing)
    
print(cleaner_list)

这将返回:
['lemons zested', 'cheddar cheese', 'carrots, chopped']

我也尝试过以下方法:

import re

dirtylist = ["lemons zested", "grated cheddar cheese", "carrots, thinly chopped"]
cleaner_list = []

bannedWord = ['grated', 'zested', 'thinly', 'chopped']
re_banned_words = re.compile(r"\b(" + "|".join(bannedWord) + ")\\W", re.I)

def remove_words(ing):
    global re_banned_words
    return re_banned_words.sub("", ing)

for ing in dirtylist:
    cleaner_ing = remove_words(ing)
    cleaner_list.append(cleaner_ing)
  
print(cleaner_list)

这将返回:

['lemons zested', 'cheddar cheese', 'carrots, chopped']

我有点迷失方向,不确定哪里出错了。非常感谢任何帮助。


尝试通过探索set来简化它,会更加清晰...问题是为什么","是被禁止的词? - Daniel Hao
4个回答

2
一些问题:
  • 你正则表达式中的最后一个 \W 要求后面必须有一个字符,这会导致在输入字符串的最后一个单词是禁用的单词时失败。你可以像在正则表达式开头一样再次使用 \b

  • 由于你想替换逗号,所以需要将其作为选项添加。确保不要将其放在同一捕获组中,否则末尾的 \\b 将要求逗号后跟着一个字母或数字字符。因此,它应该作为选项放在你的正则表达式的最后(或开头)。

  • 你可能想在删除禁用的单词后调用 .strip() 函数以删除任何剩余的空格。

因此:
def RemoveBannedWords(ing):
    pattern = re.compile("\\b(grated|zested|thinly|chopped)\\b|,", re.I)
    return pattern.sub("", ing).strip()

0
def clearList(dirtyList, bannedWords, splitChar):
    clean = []
    for dirty in dirtyList:
        ban = False
        for w in dirty.split():
            if w in bannedWords:
                ban = True

        if ban is False:
            clean.append(dirty)

    return clean

dirtyList 是你要清除的列表

bannedWords 是你不想要的单词

splitChar 是单词之间的字符(" ")


0

以下代码似乎可以工作(一个简单的嵌套循环)

dirtylist = ["lemons zested", "grated cheddar cheese", "carrots, thinly chopped"]
bannedWords = ['grated', 'zested', 'thinly', 'chopped', ',']
result = []
for words in dirtylist:
    temp = words
    for bannedWord in bannedWords:
        temp = temp.replace(bannedWord, '')
    result.append(temp.strip())
print(result)

输出

['lemons', 'cheddar cheese', 'carrots']

0
我会从bannedWord列表中去掉,,并使用str.strip来去除它:
import re

dirtylist = [
    "lemons zested",
    "grated cheddar cheese",
    "carrots, thinly chopped",
]

bannedWord = ["grated", "zested", "thinly", "chopped"]

pat = re.compile(
    r"\b" + "|".join(re.escape(w) for w in bannedWord) + r"\b", flags=re.I
)

for w in dirtylist:
    print("{:<30} {}".format(w, pat.sub("", w).strip(" ,")))

输出:

lemons zested                  lemons
grated cheddar cheese          cheddar cheese
carrots, thinly chopped        carrots

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接