Python 正则表达式匹配不被双引号包围的模式

Question

Python 正则表达式匹配不被双引号包围的模式

3

我不太熟悉正则表达式，所以需要你帮忙解决下面这个看起来有些棘手的问题。

假设我有以下字符串：

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

如何使用正则表达式获取title:hello，title:world，从原始字符串中删除这些字符串，并保留其中被双引号包围的"title:quoted"?

我已经看到了这个类似的 SO 回答，以下是我得出的答案：

import re

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'

def replace(m):
    if m.group(1) is None:
        return m.group()

    return m.group().replace(m.group(1), "")

regex = r'\"[^\"]title:[^\s]+\"|([^\"]*)'
cleaned_string = re.sub(regex, replace, string)

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'

当然，它不起作用，我并不感到惊讶，因为正则表达式对我来说很费解。

谢谢你的帮助！

最终解决方案

感谢你们的答案，这是最终的解决方案，适合我的需求：

import re
matches = []

def replace(m):
    matches.append(m.group())
    return ""

string = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = '(?<!")title:[^\s]+(?!")'
cleaned_string = re.sub(regex, replace, string)

# remove extra withespaces
cleaned_string = ' '.join(cleaned_string.split())

assert cleaned_string == 'keyword1 keyword2 "title:quoted" keyword3'
assert matches[0] == "title:hello"
assert matches[1] == "title:world"

- Agate

所以你想匹配 [关键词1，关键词2，标题：你好，标题：单词，关键词3]？ - PepperoniPizza

实际上，我想匹配[title:hello, title:world]，并将它们从字符串中删除。 - Agate

这个问题有一个非常简单的正则表达式，类似于关于正则表达式匹配模式的问题，除了...。请给我一点时间来写一个答案。 :) - zx81

好的，FYI已经添加了正则表达式的解释和在线演示。 - zx81

4个回答

3

这种情况听起来与"除非匹配某个模式，否则正则表达式匹配"非常相似。

我们可以用一条简单美丽的正则表达式来解决它：

"[^"]*"|(\btitle:\S+)

竖线 | 的左侧匹配完整的"双引号字符串"标签。我们将忽略这些匹配。右侧匹配并捕获你的title:hello字符串到第一组，我们知道它们是正确的，因为它们没有被左侧表达式匹配。

本程序展示了如何使用正则表达式（在在线演示的底部查看结果）：

import re
subject = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
regex = re.compile(r'"[^"]*"|(\btitle:\S+)')
def myreplacement(m):
    if m.group(1):
        return ""
    else:
        return m.group(0)
replaced = regex.sub(myreplacement, subject)
print(replaced)

参考资料

如何匹配（或替换）除了 s1、s2、s3 等情况之外的模式？

- zx81

谢谢您的解释，对我非常有用。我没有将您的答案标记为已接受，因为我使用了@alecxe的答案，但是您的方法似乎也有效。我阅读了您链接的答案，那是一份非常好的工作！ - Agate

1

 re.sub('[^"]title:\w+',"",string)
keyword1 keyword2 "title:quoted" keyword3

将以title:开头，后跟任何字母的子字符串替换为w+

- Padraic Cunningham

对不起，这个答案不起作用，它输出了 '关键词1 关键词2 "" 关键词3'。 - Agate

它在你提供的示例上运行正常，你希望它做什么？ - Padraic Cunningham

它在我提供的示例上没有工作。期望的输出是keyword1 keyword2 "title:quoted" keyword3，但是使用你的片段我得到了'keyword1 keyword2 "" keyword3'。 - Agate

0

有点暴力，但在所有情况下都能工作且没有灾难性回溯：

import re

string = r'''keyword1 keyword2 title:hello title:world "title:quoted"title:foo
       "abcd \" title:bar"title:foobar keyword3 keywordtitle:keyword
       "non balanced quote title:foobar'''

pattern = re.compile(
    r'''(?:
            (      # other content
                (?:(?=(
                    " (?:(?=([^\\"]+|\\.))\3)* (?:"|$) # quoted content
                  |
                    [^t"]+             # all that is not a "t" or a quote
                  |
                    \Bt                # "t" preceded by word characters
                  |
                    t (?!itle:[a-z]+)  # "t" not followed by "itle:" + letters 
                )  )\2)+
            )
          |     # OR
            (?<!") # not preceded by a double quote
        )
        (?:\btitle:[a-z]+)?''',
    re.VERBOSE)

print re.sub(pattern, r'\1', string)

- Casimir et Hippolyte

谢谢您的回答，但对于我的需求来说有点太长了。此外，即使我在问题中使用了 title:something，它也可能是其他内容，比如 url:domain.com 或 content:text。因此，匹配单词标题的第一个字母对我来说不太合适。 - Agate

@EliotBerriot：我理解，但是你可以轻松地使用想要的单词构建模式（提取单词的第一个字母并使用占位符并不难）。如果模式很长，请不要认为它会变慢。 - Casimir et Hippolyte

你说得完全正确，但与其他答案相比，它并不那么方便。 - Agate

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alecxe · Accepted Answer

您可以检查单词边界（\b）：

>>> s = 'keyword1 keyword2 title:hello title:world "title:quoted" keyword3'
>>> re.sub(r'\btitle:\w+\b', '', s, re.I)
'keyword1 keyword2   "title:quoted" keyword3'

或者你可以使用负向先行断言和负向后行断言，以检查在title:\ w + 周围是否没有引号：

>>> re.sub(r'(?<!")title:\w+(?!")', '', s)
'keyword1 keyword2   "title:quoted" keyword3'