正则表达式：如果某个字符在引号内，则不匹配该字符

Question

正则表达式：如果某个字符在引号内，则不匹配该字符

12

声明：我在此处的SO上多次阅读了这篇答案，我知道不应该使用正则表达式解析HTML。这个问题只是为了用正则表达式扩展我的知识。

假设我有这个字符串：

some text <tag link="fo>o"> other text

我想匹配整个标签，但是如果我使用<[^>]+>，它只会匹配<tag link="fo>。

如何确保引号内的>可以被忽略。

我可以轻松地编写一个while循环解析器来完成这个任务，但我想知道如何使用正则表达式实现。

- steve

4个回答

1

这是对Vasili Syrakis答案的微小改进。它完全分别处理"…"和'…'，并且不使用*?限定符。

正则表达式

<[^'">]*(("[^"]*"|'[^']*')[^'">]*)*>

演示

http://regex101.com/r/jO1oQ1

解释

<                    # start of HTML tag
    [^'">]*          #   any non-single, non-double quote or greater than
    (                #   outer group
        (            #     inner group
            "[^"]*"  #       "..."
        |            #      or
            '[^']*'  #       '...'
        )            #
        [^'">]*      #   any non-single, non-double quote or greater than
    )*               #   zero or more of outer group
>                    # end of HTML tag

这个版本比Vasilis的略好，因为单引号可以在"..."内使用，双引号可以在'...'内使用，并且不会匹配(错误的)标签，例如<a href='>。

它比Vasili的解决方案稍差，因为捕获了组。如果不想要捕获组，请在所有位置将(替换为(?:。(只使用(使正则表达式更短，也更易读)。

- zrajm

0

(<.+?>[^<]+>)|(<.+?>)

你可以创建两个正则表达式，然后使用“|”将它们组合在一起，例如：

(<.+?>[^<]+>)   #will match  some text <tag link="fo>o"> other text
(<.+?>)         #will match  some text <tag link="foo"> other text

如果第一个情况匹配成功，它将不会使用第二个正则表达式，因此请确保将特殊情况放在第一位。

- 宏杰李

0

如果你想让它可以使用转义的双引号，请尝试这样做：

/>(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g

例如：

const gtExp = />(?=((?:[^"\\]|\\.)*"([^"\\]|\\.)*")*([^"\\]|\\.)*$)/g;
const nextGtMatch = () => ((exec) => {
    return exec ? exec.index : -1;
})(gtExp.exec(xml));

如果你正在解析一堆XML，你需要设置.lastIndex。

gtExp.lastIndex = xmlIndex;
const attrEndIndex = nextGtMatch(); // the end of the tag's attributes

- qel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vasili Syrakis · Accepted Answer

正则表达式：

<[^>]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

在线演示：

http://regex101.com/r/yX5xS8

完整解释：

我知道这个正则表达式可能让人头疼，所以这里是我的解释：

<                      # Open HTML tags
    [^>]*?             # Lazy Negated character class for closing HTML tag
    (?:                # Open Outside Non-Capture group
        (?:            # Open Inside Non-Capture group
            ('|")      # Capture group for quotes, backreference group 1
            [^'"]*?    # Lazy Negated character class for quotes
            \1         # Backreference 1
        )              # Close Inside Non-Capture group
        [^>]*?         # Lazy Negated character class for closing HTML tag
    )*                 # Close Outside Non-Capture group
>                      # Close HTML tags