可变长度正则表达式回顾（lookbehind）

Question

可变长度正则表达式回顾（lookbehind）

4

我的正则表达式如下：

(?<![\s]*?(\"|&quot;)")WORD(?![\s]*?(\"|&quot;))

正如您所看到的，我正在尝试匹配除了在“引号”内的所有单词。所以...

WORD <- Find this
"WORD" <- Don't find this
"   WORD   " <- Also don't find this, even though not touching against marks
&quot;WORD&quot;  <- Dont find this (I check &quot; and " so works after htmlspecialchars)

我相信我的正则表达式是完美的，如果我没有收到以下错误：

Compilation failed: lookbehind assertion is not fixed length

考虑到回溯限制，是否有其他方法可以实现我的意图？

如果你能想到其他的方法，请告诉我。

非常感谢，

马修

附言：WORD部分将实际包含Jon Grubers URL检测器。

- mrmrw

@Qtax提出了一个好问题：你是想用什么替换掉检测到的单词吗？ - FrankieTheKneeMan

另外，我认为你尝试构建的正则表达式更像是：(?<!("|")\s*)WORD(?![\s]*(\"|"))。没有必要使空格捕获变得懒惰，并且你在回顾中翻转了空格/引号术语。 - FrankieTheKneeMan

2个回答

1

我建议删除引用字符串，然后搜索剩余部分。

$noSubs = preg_replace('/(["\']|&quot;)(\\\\\1|(?!\1).)*\1/', '', $target);
$n = preg_match_all('/\bWORD\b/', $noSubs, $matches);

我使用的正则表达式来替换上面的引号字符串，将&quote;、"和'视为不同的字符串定界符。对于任何给定的定界符，你的正则表达式看起来更像这样：

/"(\\"|[^"])*"/

因此，如果您希望将"视为等同于"：

/("|&quot;)(\\("|&quot;)|(?!&quot;)[^"])*("|&quot;)/i

如果您想处理单引号字符串（假设没有带撇号的单词）：

/("|&quot;)(\\("|&quot;)|(?!&quot;)[^"])*("|&quot;)|'(\\'|[^'])*'/i

当将这些内容转义后放入PHP字符串时，请小心。

编辑

Qtax提到你可能正在尝试替换匹配的WORD数据。在这种情况下，您可以使用此正则表达式轻松地对字符串进行分词：

/("|&quot;)(\\("|&quot;)|(?!&quot;)[^"])*("|&quot;)|((?!"|&quot;).)+/i

将文本分为带引号的字符串和不带引号的部分，然后只在不带引号的部分上进行替换操作，构建一个新的字符串。

$tokenizer = '/("|&quot;)(\\\\("|&quot;)|(?!&quot;)[^"])*("|&quot;)|((?!"|&quot;).)+/i';
$hasQuote = '/"|&quot;/i';
$word = '/\bWORD\b/';
$replacement = 'REPLACEMENT';
$n = preg_match_all($tokenizer, $target, $matches, PREG_SET_ORDER);
$newStr = '';
if ($n === false) {
    /* Print error Message */
    die();
}
foreach($matches as $match){
    if(preg_match($hasQuote, $match[0])){
        //If it has a quote, it's a quoted string.
        $newStr .= $match[0];
    } else {
        //Otherwise, run the replace.
        $newStr .= preg_replace($word, $replacement, $match[0]);
    }
}

//Now $newStr has your replaced String.  Return it from your function, or print it to
//your page.

- FrankieTheKneeMan

这是一种方法，但如果您想用其他内容替换这些引号，则无法使用。顺便说一下，我认为'\\\1'的值为\\1，PCRE将其解释为匹配一个普通反斜杠后跟一个1。也许您的意思是'(?:\\\\.|(?!\1).)*'。 - Qtax

我不这么认为 - 在php中，单引号字符串不允许转义序列，除了\'。但是我真的不确定。让我查一下文档...(时间流逝)...你是对的。http://www.php.net/manual/en/language.types.string.php#language.types.string.syntax.single '\''或'\\'。我马上更新一下。 - FrankieTheKneeMan

@Qtax - 你的意思是如果你想用其他东西替换掉检测到的单词？你可以轻松地创建一个更加程序化的方法来实现这一点。个人而言，我会将字符串分解为“引用”和“非引用”子字符串，然后在“非引用”位上运行替换操作。如果这是使用情况，请告诉我们，mrmrw。 - FrankieTheKneeMan

太棒了，非常感谢。我正在我的项目中测试这个新东西，等它运行起来后会告诉你效果如何。 - mrmrw

非常感谢您的回答 - 我认为在这种情况下，我会选择Tim的答案，因为虽然我对HTML/CSS/JS/PHP有很多经验，但我对正则表达式还比较新手，而Tim的答案可能更容易理解。但是，再次感谢您，感谢您，感谢您的回答。这是一个令人惊叹的地方，您是一个了不起的人。我打算利用我的空闲时间尽可能经常地回馈stackoverflow。所以再次感谢。 - mrmrw

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tim Pietzcker · Accepted Answer

我建议采用不同的方法。只要引号平衡正确，这种方法就有效，因为此时如果后面的引号数目是奇数，你就知道你正在一个带引号的字符串内，这样就不需要使用“lookbehind”部分了：

if (preg_match(
'/WORD             # Match WORD
(?!                # unless it\'s possible to match the following here:
 (?:               # a string of characters
  (?!&quot;)       # that contains neither &quot;
  [^"]             # nor "
 )*                # (any length),
 ("|&quot;)        # followed by either " or &quot; (remember which in \1)
 (?:               # Then match
  (?:(?!\1).)*\1   # any string except our quote char(s), followed by that quote char(s)
  (?:(?!\1).)*\1   # twice,
 )*                # repeated any number of times --> even number
 (?:(?!\1).)*      # followed only by strings that don\'t contain our quote char(s)
 $                 # until the end of the string
)                  # End of lookahead/sx', 
$subject))