使用preg_replace(无回调)转换缩进

22

我有一些XML块是由DOMDocument::saveXML()返回的。它已经相当缩进,每个级别有两个空格,就像这样:

<?xml version="1.0"?>
<root>
  <error>
    <a>eee</a>
    <b>sd</b>
  </error>
</root>

由于无法配置 DOMDocument (据我所知),以设置缩进字符,因此我认为可以运行正则表达式并通过将所有两个空格对替换为制表符来更改缩进。这可以通过回调函数完成(Demo):

<code>$xml_string = $doc->saveXML();
function callback($m)
{
    $spaces = strlen($m[0]);
    $tabs = $spaces / 2;
    return str_repeat("\t", $tabs);
}
$xml_string = preg_replace_callback('/^(?:[ ]{2})+/um', 'callback', $xml_string);
</code>

我现在想知道是否可以在没有回调函数的情况下(且不使用e修饰符(EVAL))实现这个。有什么正则表达式高手有想法吗?

2个回答

24
您可以使用\G
preg_replace('/^  |\G  /m', "\t", $string);

我进行了一些基准测试,并在使用PHP 5.2和5.4的Win32上得出以下结果:

>php -v
PHP 5.2.17 (cli) (built: Jan  6 2011 17:28:41)
Copyright (c) 1997-2010 The PHP Group
Zend Engine v2.2.0, Copyright (c) 1998-2010 Zend Technologies

>php -n test.php
XML length: 21100
Iterations: 1000
callback: 2.3627231121063
\G:       1.4221360683441
while:    3.0971200466156
/e:       7.8781840801239


>php -v
PHP 5.4.0 (cli) (built: Feb 29 2012 19:06:50)
Copyright (c) 1997-2012 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2012 Zend Technologies

>php -n test.php
XML length: 21100
Iterations: 1000
callback: 1.3771259784698
\G:       1.4414191246033
while:    2.7389969825745
/e:       5.5516891479492

令人惊讶的是,在PHP 5.4中,回调函数比\G更快(尽管这似乎取决于数据,在其他一些情况下\G更快)。
对于\G,使用/^ |\G /m,并且比/(?:^|\G) /m略快。/(?>^|\G) /m甚至比/(?:^|\G) /m更慢。/u/S/X开关对\G的性能影响不明显。
如果深度较低(在我的测试中,大约为4个缩进、8个空格),则使用while替换最快,但随着深度的增加变得越来越慢。
使用了以下代码:
<?php

$base_iter = 1000;

$xml_string = str_repeat(<<<_STR_
<?xml version="1.0"?>
<root>
  <error>
    <a>  eee  </a>
    <b>  sd    </b>         
    <c>
            deep
                deeper  still
                    deepest  !
    </c>
  </error>
</root>
_STR_
, 100);


//*** while ***

$re = '%# Match leading spaces following leading tabs.
    ^                     # Anchor to start of line.
    (\t*)                 # $1: Preserve any/all leading tabs.
    [ ]{2}                # Match "n" spaces.
    %mx';

function conv_indent_while($xml_string) {
    global $re;

    while(preg_match($re, $xml_string))
        $xml_string = preg_replace($re, "$1\t", $xml_string);

    return $xml_string;
}


//*** \G ****

function conv_indent_g($string){
    return preg_replace('/^  |\G  /m', "\t", $string);
}


//*** callback ***

function callback($m)
{
    $spaces = strlen($m[0]);
    $tabs = $spaces / 2;
    return str_repeat("\t", $tabs);
}
function conv_indent_callback($str){
    return preg_replace_callback('/^(?:[ ]{2})+/m', 'callback', $str);
}


//*** callback /e *** 

function conv_indent_e($str){
    return preg_replace('/^(?:  )+/me', 'str_repeat("\t", strlen("$0")/2)', $str);
}



//*** tests

function test2() {
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){
        $s = conv_indent_while($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 2");
    }

    return (microtime(true) - $t);
}

function test1() {
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){
        $s = conv_indent_g($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 1");
    }

    return (microtime(true) - $t);
}

function test0(){
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){     
        $s = conv_indent_callback($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 0");
    }

    return (microtime(true) - $t);
}


function test3(){
    global $base_iter;
    global $xml_string;
    $t = microtime(true);

    for($i = 0; $i < $base_iter; ++$i){     
        $s = conv_indent_e($xml_string);
        if(strlen($s) >= strlen($xml_string))
            exit("strlen invalid 02");
    }

    return (microtime(true) - $t);
}



echo 'XML length: ' . strlen($xml_string) . "\n";
echo 'Iterations: ' . $base_iter . "\n";

echo 'callback: ' . test0() . "\n";
echo '\G:       ' . test1() . "\n";
echo 'while:    ' . test2() . "\n";
echo '/e:       ' . test3() . "\n";


?>

这并不能做到(演示),但我会寻找\G代表的含义。 - hakre
@hakre,这很奇怪,在Perl中它可以正常工作。\G应该匹配前一个匹配的结尾。 - Qtax
根据PCRE,Perl中的\G与PCRE不同,因为PCRE需要一个偏移量。@NikiC:依赖于PCRE库吗?看起来新版本会存储上次匹配的偏移量并使用它。 - hakre
至少版本差异可能会解释这个问题:Viper: 6.6 06-Feb-2006; Codepad.org: 3.9 02-Jan-2002 - 我想我可以安全地忽略它,否则如果该特性得到支持,这是一个很好的测试用例。感谢您的快速回答! - hakre
3
是的,这是最佳答案。我的初始测试运行针对的是一个不代表性的数据集。我已经纠正了我的答案,并提供了所使用的基准测试脚本。再次感谢\G示例! - ridgerunner

5
以下简单的解决方案首先浮现在脑海中:
$xml_string = str_replace('  ', "\t", $xml_string);

但我假设您希望将替换限制为前导空格。对于这种情况,您目前的解决方案看起来非常干净。尽管如此,您可以无需回调或e修饰符来完成任务,但需要递归运行它才能完成:

$re = '%# Match leading spaces following leading tabs.
    ^                     # Anchor to start of line.
    (\t*)                 # $1: Preserve any/all leading tabs.
    [ ]{2}                # Match "n" spaces.
    %umx';
while(preg_match($re, $xml_string))
    $xml_string = preg_replace($re, "$1\t", $xml_string);

出人意料的是,我的测试表明这种方法几乎比回调方法快一倍。(我本来猜想相反的结果。)
请注意,Qtax有一个优雅的解决方案,可以正常运行(我给了它我的赞成)。然而,我的基准测试显示它比原始的回调方法慢。我认为这是因为表达式/(?:^|\G) /um不允许正则表达式引擎利用内部优化中的“模式开头锚点”。RE引擎被强制针对目标字符串中的每个位置测试模式。对于以^锚点开头的模式表达式,RE引擎只需要在每行开头检查,这使得匹配速度更快。
非常好的问题!+1
附加/更正:
我必须道歉,因为我上面所说的性能陈述是错误的。我只对一个(不具代表性的)测试文件运行了正则表达式,该文件主要包含导致缩进的制表符。当针对具有大量前导空格的更现实的文件进行测试时,我上面的递归方法的性能显着低于其他两种方法。
如果有人感兴趣,这是我用来测量每个正则表达式性能的基准测试脚本:
<?php // test.php 20120308_1200
require_once('inc/benchmark.inc.php');

// -------------------------------------------------------
// Test 1: Recursive method. (ridgerunner)
function tabify_leading_spaces_1($xml_string) {
    $re = '%# Match leading spaces following leading tabs.
        ^                     # Anchor to start of line.
        (\t*)                 # $1: Any/all leading tabs.
        [ ]{2}                # Match "n" spaces.
        %umx';
    while(preg_match($re, $xml_string))
        $xml_string = preg_replace($re, "$1\t", $xml_string);
    return $xml_string;
}

// -------------------------------------------------------
// Test 2: Original callback method. (hakre)
function tabify_leading_spaces_2($xml_string) {
    return preg_replace_callback('/^(?:[ ]{2})+/um', '_callback', $xml_string);
}
function _callback($m) {
    $spaces = strlen($m[0]);
    $tabs = $spaces / 2;
    return str_repeat("\t", $tabs);
}

// -------------------------------------------------------
// Test 3: Qtax's elegantly simple \G method. (Qtax)
function tabify_leading_spaces_3($xml_string) {
    return preg_replace('/(?:^|\G)  /um', "\t", $xml_string);
}

// -------------------------------------------------------
// Verify we get the same results from all methods.
$data = file_get_contents('testdata.txt');
$data1 = tabify_leading_spaces_1($data);
$data2 = tabify_leading_spaces_2($data);
$data3 = tabify_leading_spaces_3($data);
if ($data1 == $data2 && $data2 == $data3) {
    echo ("GOOD: Same results.\n");
} else {
    exit("BAD: Different results.\n");
}
// Measure and print the function execution times.
$time1 = benchmark_12('tabify_leading_spaces_1', $data, 2, true);
$time2 = benchmark_12('tabify_leading_spaces_2', $data, 2, true);
$time3 = benchmark_12('tabify_leading_spaces_3', $data, 2, true);
?>

上述脚本使用我之前编写的以下实用小型基准测试功能:

benchmark.inc.php

<?php // benchmark.inc.php
/*----------------------------------------------------------------------------
 function benchmark_12($funcname, $p1, $reptime = 1.0, $verbose = true, $p2 = NULL) {}
    By: Jeff Roberson
    Created:        2010-03-17
    Last edited:    2012-03-08

Discussion:
    This function measures the time required to execute a given function by
    calling it as many times as possible within an allowed period == $reptime.
    A first pass determines a rough measurement of function execution time
    by increasing the $nreps count by a factor of 10 - (i.e. 1, 10, 100, ...),
    until an $nreps value is found which takes more than 0.01 secs to finish.
    A second pass uses the value determined in the first pass to compute the
    number of reps that can be performed within the allotted $reptime seconds.
    The second pass then measures the time required to call the function the
    computed number of times (which should take about $reptime seconds). The
    average function execution time is then computed by dividing the total
    measured elapsed time by the number of reps performed in that time, and
    then all the pertinent values are returned to the caller in an array.

    Note that this function is limited to measuring only those functions
    having either one or two arguments that are passed by value and
    not by reference. This is why the name of this function ends with "12".
    Variations of this function can be easily cloned which can have more
    than two parameters.

Parameters:
    $funcname:  String containing name of function to be measured. The
                function to be measured must take one or two parameters.
    $p1:        First argument to be passed to $funcname function.
    $reptime    Target number of seconds allowed for benchmark test.
                (float) (Default=1.0)
    $verbose    Boolean value determines if results are printed.
                (bool) (Default=true)
    $p2:        Second (optional) argument to be passed to $funcname function.
Return value:
    $result[]   Array containing measured and computed values:
    $result['funcname']     : $funcname - Name of function measured.
    $result['msg']          : $msg - String with formatted results.
    $result['nreps']        : $nreps - Number of function calls made.
    $result['time_total']   : $time - Seconds to call function $nreps times.
    $result['time_func']    : $t_func - Seconds to call function once.
    $result['result']       : $result - Last value returned by function.

Variables:
    $time:      Float epoch time (secs since 1/1/1970) or benchmark elapsed secs.
    $i:         Integer loop counter.
    $nreps      Number of times function called in benchmark measurement loops.

----------------------------------------------------------------------------*/
function benchmark_12($funcname, $p1, $reptime = 1.0, $verbose = false, $p2 = NULL) {
    if (!function_exists($funcname)) {
        exit("\n[benchmark1] Error: function \"{$funcname}()\" does not exist.\n");
    }
    if (!isset($p2)) { // Case 1: function takes one parameter ($p1).
    // Pass 1: Measure order of magnitude number of calls needed to exceed 10 milliseconds.
        for ($time = 0.0, $n = 1; $time < 0.01; $n *= 10) { // Exponentially increase $nreps.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $n; ++$i) {       // Loop $n times. ($n = 1, 10, 100...)
                $result = ($funcname($p1));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $nreps = $n;                        // Number of reps just measured.
        }
        $t_func = $time / $nreps;               // Function execution time in sec (rough).
    // Pass 2: Measure time required to perform $nreps function calls (in about $reptime sec).
        if ($t_func < $reptime) {               // If pass 1 time was not pathetically slow...
            $nreps = (int)($reptime / $t_func); // Figure $nreps calls to add up to $reptime.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $nreps; ++$i) {   // Loop $nreps times (should take $reptime).
                $result = ($funcname($p1));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $t_func = $time / $nreps;           // Average function execution time in sec.
        }
    } else { // Case 2: function takes two parameters ($p1 and $p2).
    // Pass 1: Measure order of magnitude number of calls needed to exceed 10 milliseconds.
        for ($time = 0.0, $n = 1; $time < 0.01; $n *= 10) { // Exponentially increase $nreps.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $n; ++$i) {       // Loop $n times. ($n = 1, 10, 100...)
                $result = ($funcname($p1, $p2));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $nreps = $n;                        // Number of reps just measured.
        }
        $t_func = $time / $nreps;               // Function execution time in sec (rough).
    // Pass 2: Measure time required to perform $nreps function calls (in about $reptime sec).
        if ($t_func < $reptime) {               // If pass 1 time was not pathetically slow...
            $nreps = (int)($reptime / $t_func); // Figure $nreps calls to add up to $reptime.
            $time = microtime(true);            // Mark start time. (sec since 1970).
            for ($i = 0; $i < $nreps; ++$i) {   // Loop $nreps times (should take $reptime).
                $result = ($funcname($p1, $p2));     // Call the function over and over...
            }
            $time = microtime(true) - $time;    // Mark stop time. Compute elapsed secs.
            $t_func = $time / $nreps;           // Average function execution time in sec.
        }
    }
    $msg = sprintf("%s() Nreps:%7d  Time:%7.3f s  Function time: %.6f sec\n",
            $funcname, $nreps, $time, $t_func);
    if ($verbose) echo($msg);
    return array('funcname' => $funcname, 'msg' => $msg, 'nreps' => $nreps,
        'time_total' => $time, 'time_func' => $t_func, 'result' => $result);
}
?>

当我使用 benchmark.inc.php 中的代码运行 test.php 时,我得到了以下结果: GOOD: Same results.
tabify_leading_spaces_1() Nreps: 1756 Time: 2.041 s Function time: 0.001162 sec
tabify_leading_spaces_2() Nreps: 1738 Time: 1.886 s Function time: 0.001085 sec
tabify_leading_spaces_3() Nreps: 2161 Time: 2.044 s Function time: 0.000946 sec 总之,我建议使用 Qtax 的方法。
感谢 Qtax!

我简直无法相信\G比这还要慢。在替换制表符的每个深度,您都需要两次读取字符串。这应该比仅一次传递字符串要慢得多,而不是针对每个添加的制表符深度进行两次读取。 - Qtax
我很想看看你的基准测试脚本,这样可以更容易地进行审查。调整这个应该会很有帮助。当前赏金期过后,我还想增加一些赏金。到目前为止,我还没有进行过广泛的测试。 - hakre
我在我的答案中添加了一些基准测试,\G 在 5.2 中是最快的,回调在 5.4 中是最快的,如果缩进深度较低,则使用此解决方案。 - Qtax
@Qtax - 你是对的。我运行测试文件并不具有代表性。请参考更新的答案。 - ridgerunner
@hakre - 我已更新我的答案并提供了用于测试的基准测试代码。 - ridgerunner

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接