PHP preg_match_all限制数量

Question

PHP preg_match_all限制数量

regexpreg-matchpreg-match-allphp

4

我正在使用 preg_match_all 处理非常长的模式。

运行代码时，我遇到了以下错误：

警告：preg_match_all()：编译失败：正则表达式偏移量为 707830 的过大。

经过搜索，我找到了解决方案，所以我应该增加 php.ini 中的 pcre.backtrack_limit 和 pcre.recursion_limit 值。

但是，即使我增加了这些值并重新启动了 Apache，仍然遇到了同样的问题。我的 PHP 版本是 5.3.8。

- Ahmad

9

请提供您所使用的正则表达式。 - Narendra Yadala

3个回答

7

增加PCRE回溯和递归限制可能会解决问题，但当数据大小达到新限制时仍然会失败。（在处理更多数据时不具有可扩展性）

例子：

<?php 
// essential for huge PCREs
ini_set("pcre.backtrack_limit", "23001337");
ini_set("pcre.recursion_limit", "23001337");
// imagine your PCRE here...
?>

为了真正解决潜在的问题，您必须优化您的表达式，并（如果可能）将复杂的表达式分成“部分”，并将一些逻辑移动到PHP中。我希望通过阅读示例来理解这个想法...而不是试图直接使用单个PCRE找到子结构，我展示了一种更“迭代”的方法，使用PHP深入挖掘结构。示例：

<?php
$html = file_get_contents("huge_input.html");

// first find all tables, and work on those later
$res = preg_match_all("!<table.*>(?P<content>.*)</table>!isU", $html, $table_matches);

if ($res) foreach($table_matches['content'] as $table_match) {  

    // now find all cells in each table that was found earlier ..
    $res = preg_match_all("!<td.*>(?P<content>.*)</td>!isU", $table_match, $cell_matches);

    if ($res) foreach($cell_matches['content'] as $cell_match) {

        // imagine going deeper and deeper into the structure here...
        echo "found a table cell! content: ", $cell_match;

    }    
}

- Kaii

实际上对于我的情况，模式本身非常长。我有一个由|分隔的已阻止网站列表，例如sex.com | porn.com | bad.com。你的解决方案看起来很好。在我尝试将模式分成较小的部分后，它运行良好 :) 谢谢Kaii - Ahmad

4

我写这篇答案，是因为我遇到了相同的问题。正如Alan Moore所指出的，调整回溯和递归限制并不能解决这个问题。

当一个针头超过底层pcre库所限制的最大针头大小时，描述的错误就会发生。所述错误不是由php引起的，而是由底层pcre库引起的。这是错误信息#20，在这里进行了定义：

https://github.com/php/.../pcre_compile.c#L477

PHP在失败时只是打印了来自pcre库的错误文本。

但是，当我尝试使用先前捕获的片段作为针头并且它们大于32k字节时，就会出现此错误。

可以通过使用php cli中的这个简单脚本轻松测试。

<?php
// This script demonstrates the above error and dumps an info
// when the needle is too long or with 64k iterations.

$expand=$needle="_^b_";
while( ! preg_match( $needle, "Stack Exchange Demo Text" ) )
{
    // Die after 64 kbytes of accumulated chunk needle
    // Adjust to 32k for a better illustration
    if ( strlen($expand) > 1024*64 ) die();

    if ( $expand == "_^b_" ) $expand = "";
    $expand .= "a";
    $needle = '_^'.$needle.'_ism';

    echo strlen($needle)."\n";

}
?>

要修复此错误，需要降低所得到的搜索模式，或者 - 如果需要捕获所有内容 - 则需要使用带有额外偏移量参数的多个 preg_match。

<?php
    if ( 
        preg_match( 
            '/'.preg_quote( 
                    substr( $big_chunk, 0, 20*1024 ) // 1st 20k chars
                ) 
                .'.*?'. 
                preg_quote( 
                    substr( $big_chunk, -5 ) // last 5
                ) 
            .'/', 
            $subject 
        ) 
    ) { 
        // do stuff
    }

    // The match all needles in text attempt
    if ( preg_match( 
            $needle_of_1st_32kbytes_chunk, 
            $subj, $matches, $flags = 0, 
            $offset = 32*1024*0 // Offset -> 0
        )
        && preg_match( 
            $needle_of_2nd_32kbytes_chunk, 
            $subj, $matches, $flags = 0, 
            $offset = 32*1024*1 // Offset -> 32k
        )
        // && ... as many preg matches as needed
    ) {
        // do stuff
    }

    // it would be nicer to put the texts in a foreach-loop iterating
    // over the existings chunks 
?>

你明白了吗。

尽管这个答案有点晚，我希望它仍然能够帮助那些遇到这个问题但没有好的解释的人。

- derRaphael

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alan Moore · Accepted Answer

这个错误与正则表达式的性能无关，而是与正则表达式本身有关。改变pcre.backtrack_limit和pcre.recursion_limit不会产生任何影响，因为正则表达式永远没有机会运行。问题在于正则表达式太大了，解决方案是使正则表达式更小——要小得多。