Python绕过re.finditer匹配：当搜索词在一个定义的表达式中时。

Question

Python绕过re.finditer匹配：当搜索词在一个定义的表达式中时。

3

我有一个单词列表（find_list），想在文本中找到这些单词，并有一个包含这些单词的表达式列表（scape_list），当它们出现在文本中时，想要跳过这些表达式。

使用以下代码，我可以在文本中找到所有的单词：

find_list = ['name', 'small']
scape_list = ['small software', 'company name']

text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."

final_list = []

for word in find_list:
    
    s = r'\W{}\W'.format(word)
    matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))

    for word_ in matches:
        final_list.append(word_.group(0))

最终列表如下:

[' name ', ' name ', ' Name.', ' small ']

有没有一种方法可以绕过在scape_list中列出的表达式，获得这个结果的final_list。

final_list和scape_list总是在更新。因此，我认为正则表达式是一个不错的方法。

- Thabra

你需要移除重复项吗？ - Ahmed Yousif

不，这只是巧合。 - Thabra

我会更改这个例子。 - Thabra

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ggaurav · Accepted Answer

您可以使用正则表达式捕获在find_list单词之前和之后的单词，并检查这两个组合是否都不在scape_list中。我已经在更改代码时添加了注释。（如果scape_list可能变得很大，则最好将其更改为set）

find_list = ['name', 'small']
scape_list = ['small software', 'company name']

text = "My name is Klaus and my middle name is Smith. I work for a small company. The company name is Small Software. Small Software sells Software Name."

final_list = []

for word in find_list:
    
    s = r'(\w*\W)({})(\W\w*)'.format(word) # change the regex to capture adjacent words
    matches = re.finditer(s, text, (re.MULTILINE | re.IGNORECASE))

    for word_ in matches:
        if ((word_.group(1) + word_.group(2)).strip().lower() not in scape_list
            and (word_.group(2) + word_.group(3)).strip().lower() not in scape_list): # added this condition
            final_list.append(word_.group(2)) # changed here

final_list
['name', 'name', 'Name', 'small']