XPath表达式：选择A HREF="expr"标签之间的元素

Question

XPath表达式：选择A HREF="expr"标签之间的元素

5

我没有找到一种明确的方法来选择HTML文件中两个锚点(<a></a>标签对)之间存在的所有节点。

第一个锚点的格式如下：

<a href="file://START..."></a>

第二个锚点：

<a href="file://END..."></a>

我已经验证可以使用 starts-with 选择两个元素（注意我正在使用 HTML Agility Pack）：

HtmlNode n0 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://START')]"));
HtmlNode n1 = html.DocumentNode.SelectSingleNode("//a[starts-with(@href,'file://END')]"));

考虑到这一点，以及我的业余XPath技能，我编写了以下表达式来获取两个锚点之间的所有标签：

html.DocumentNode.SelectNodes("//*[not(following-sibling::a[starts-with(@href,'file://START0')]) and not (preceding-sibling::a[starts-with(@href,'file://END0')])]");

这似乎可以工作，但会选中整个HTML文档！

我需要为以下HTML片段进行操作：

<html>
...

<a href="file://START0"></a>
<p>First nodes</p>
<p>First nodes
    <span>X</span>
</p>
<p>First nodes</p>
<a href="file://END0"></a>

...
</html>

移除两个锚点，三个P标签（当然包括内部的SPAN标签）。

有什么方法可以做到这一点吗？

我不知道XPath 2.0是否提供更好的方法来实现这一点。

*编辑（特殊情况！）*

我还应该处理以下情况：

“选择X和X'之间的标签，其中X是<p><a href="file://..."></a></p>”

所以不是：

<a href="file://START..."></a>
<!-- xhtml to be extracted -->
<a href="file://END..."></a>

我也应该处理以下事项：

<p>
  <a href="file://START..."></a>
</p>
<!-- xhtml to be extracted -->

<p>
  <a href="file://END..."></a>
</p>

非常感谢，再次致谢。

- Hernán

1

好问题，+1。请看我的答案，其中包括两种解决方案（XPath 1.0和XPath 2.0），解释以及它们在作为XPath主机的XSLT中的验证。 - Dimitre Novatchev

2个回答

2

我已经添加了一个特殊情况需要处理。

要处理这个特殊情况，你可以采用同样的方式，也就是使用Kayessian（同时也使用XPath Visualizer;-)）。交叉节点集的变化如下：

交叉节点集C

    "//p[.//a[starts-with(@href,'file://START')]]
         /following-sibling::node()"

所有包含 a 的 p 元素后续的同级元素开始。

相交节点集 D

"./following-sibling::p[.//a[starts-with(@href,'file://END')]]
    /preceding-sibling::node())"

包含 a 的当前 p 元素之前的所有同级元素以及当前 p 元素之后的同级元素 END

现在您可以执行交集操作：

C ∩ D

即

    "//p[.//a[starts-with(@href,'file://START')]]
            /following-sibling::node()[
            count(.| ./following-sibling::p
                     [.//a[starts-with(@href,'file://END')]]
                       /preceding-sibling::node())
            =
            count(./following-sibling::p
                   [.//a[starts-with(@href,'file://END')]]
                     /preceding-sibling::node())
            ]"

如果你需要同时管理这两种情况，可以将相交的节点集合并为：

(A ∩ B) ∪ (C ∩ D)

其中：

必须使用XPath联合运算符|：
节点集A和B已在@Dimitre的回答中展示。
节点集C和D是我的回答中展示的那些。

- Emiliano Poggi

非常棒，非常感谢！XPath 2.0似乎使得这些集合操作更加容易，不幸的是，在.NET 3中没有2.0支持！ - Hernán

似乎可以在$n1[count(.|$n2)=count($n2)]的位置上写成$n1 intersect $n2。无论如何，节点集的选择都很棘手。 - Emiliano Poggi

@Dimitre：这意味着我绝对是个好人！:D 谢谢 - Emiliano Poggi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dimitre Novatchev · Accepted Answer

请使用此XPath 1.0表达式：:

//a[starts-with(@href,'file://START')]/following-sibling::node()
     [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
     =
      count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
     ]

或者，使用这个XPath 2.0表达式:

    //a[starts-with(@href,'file://START')]/following-sibling::node()
  intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()

这个XPath 2.0表达式使用了XPath 2.0的intersect运算符。

而这个XPath 1.0表达式则使用了Kayessian（@Michael Kay）公式来计算两个节点集的交集：

$ns1[count(.|$ns2) = count($ns2)]

XSLT验证:

这个XSLT 1.0转换:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "    //a[starts-with(@href,'file://START')]/following-sibling::node()
         [count(.| //a[starts-with(@href,'file://END')]/preceding-sibling::node())
         =
          count(//a[starts-with(@href,'file://END')]/preceding-sibling::node())
         ]
  "/>
 </xsl:template>
</xsl:stylesheet>

应用于提供的XML文档时:

<html>...
    <a href="file://START0"></a>
    <p>First nodes</p>
    <p>First nodes    
        <span>X</span>
    </p>
    <p>First nodes</p>
    <a href="file://END0"></a>...
</html>

产生所需的、正确的结果：

<p>First nodes</p>
<p>First nodes    
        <span>X</span>
</p>
<p>First nodes</p>

这个XSLT 2.0转换:

<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  " //a[starts-with(@href,'file://START')]/following-sibling::node()
   intersect
    //a[starts-with(@href,'file://END')]/preceding-sibling::node()
  "/>
 </xsl:template>
</xsl:stylesheet>

当再次应用于相同的XML文档（上述）时，会产生完全符合要求的结果。