使用正则表达式解析非节点、间歇性的XML值

Question

使用正则表达式解析非节点、间歇性的XML值

3

这是一个关于正则表达式的问题。

如果我有一系列XML节点，我想要使用正则表达式解析出与我的当前节点在同一级别的包含节点值。例如，如果我有以下内容：

<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
<top-node>

我想检索一个数组，该数组为：

array(
    1 => 'Hi',
    2 => 'Hey',
    3 => 'Bar'
)

我知道我可以从以下内容开始：

$inside = preg_match('~<(\S+).*?>(?P<inside>(.|\s)*)</\1>~', $original_text);

并且这将检索出文本，但不包括顶级节点。然而，下一步有点超出我的正则表达式能力。

编辑：实际上，那个preg_match似乎只在$original_text都在同一行时才工作。此外，我认为我可以使用一个非常类似的正则表达式的preg_split来检索我要查找的内容-它只是在多行上没有起作用。

注意：我感谢并将遵守任何澄清请求；但是，我的问题非常具体，我的意思是我在问什么，所以不要给出像“去使用SimpleXML”之类的答案。谢谢您的所有帮助。

- MirroredFate

谢谢。那是一个错误，应该是“嗨”。我会修复它。 - MirroredFate

1

一些（相关的）幽默感：https://dev59.com/X3I-5IYBdhLWcg3wq6do#1732454，https://dev59.com/nWoy5IYBdhLWcg3wZtNC，http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html - GitaarLAB

1

哈哈，第一个链接让我笑得很厉害。 - MirroredFate

你是否仍在继续你的探索，还是已经准备好开始解析器了？ - GitaarLAB

不了，我更愿意不使用额外的库来完成本应该是相对简单的任务。 - MirroredFate

你说“不要给出像‘去使用SimpleXML’这样的答案”，但那就是答案。 - Andy Lester

2个回答

1

基于您自己的想法，使用 preg_split，我得出了以下结果：

$raw="<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
</top-node>";

$reg='~<(\S+).*?>(.*?)</\1>~s';
preg_match_all($reg, $raw, $res);
$res = explode(chr(31), preg_replace($reg, chr(31), $res[2][0]));

注意，chr(31)是“单元分隔符”

使用以下代码测试结果数组：

echo ("<xmp>start\n" . print_r($res, true) . "\nfin</xmp>");

那似乎对于一个节点可以工作，给你所需的数组，但它可能会有各种问题。您可能还想要修剪返回的值。编辑：Denomales的答案可能更好。

- GitaarLAB

这基本上是我在问了这个问题之后得出的结论。不幸的是，我遇到了一个问题，如果我要匹配的字符串超过一定长度，它就无法正常工作。 - MirroredFate

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ro Yo Mi · Accepted Answer

描述

这个正则表达式将捕获第一层文本。

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?[\s\r\n]*\K(?!\Z)(?:(?![\s\r\n]*(?:<|\Z)).)*1

enter image description here

Expanded

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?   # match any open tags until the close tags if they exist
[\s\r\n]*    # match any leading spaces or new line characters 
\K           # reset the capture and only capture the desired substring which follows
(?!\Z)       # validate substring is not the end of the string, this prevents the phantom empty array value at the end
(?:(?![\s\r\n]*(?:<|\Z)).)*    # capture the text inside the current substring, this expression is self limiting and will stop when it sees whitespace ahead followed by end of string or a new tag

例子

样本文本

这是假设您已经删除了第一个顶级标签

Hi
<second-node>
    Hello
    <inner-node>
    </inner-node>
</second-node>
Hey
<third-node>
   Foo
</third-node>
Bar

捕获组

0：是实际捕获的组
1：是子标记的名称，然后在正则表达式中进行反向引用

[0] => Array
    (
        [0] => Hi
        [1] => Hey
        [2] => Bar
    )

[1] => Array
    (
        [0] => 
        [1] => second-node
        [2] => third-node
    )

免责声明

该解决方案将在嵌套结构上卡住，例如：

Hi
<second-node>
    Hello
    <second-node>
    </second-node>
    This string will be found
</second-node>
Hey