使用XPath查找连续的兄弟节点

Question

使用XPath查找连续的兄弟节点

4

作为XPath专家，这是一个简单的点！ :)

文档结构:

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

忽略文档的语义概率，我想要提取出 [["Newt", "Gingrich"], ["Garry", "Trudeau"]]，即：当有两个连续的标记其实体类型为PROPER_NOUN时，我想从这两个标记中提取单词。

我已经完成了以下步骤：

"//token[entityType='PROPER_NOUN']/following-sibling::token[1][entityType='PROPER_NOUN']"

... 我已经找到了两个连续的PROPER_NOUN标记中的第二个，但我不确定如何使其与第一个标记一起输出。

一些注释：

如果在NodeSets中有三个或更多连续的PROPER_NOUN标记（称为A，B，C），最好能够发出[A，B]，[B，C]。
如果需要更高级别的处理NodeSets（例如在Ruby / Nokogiri中），我不介意，只要简化问题即可。

更新

这是我的解决方案，使用更高级别的Ruby函数。但我厌倦了所有那些XPath恶棍踢沙子在我的脸上，我想知道真正的XPath编码者是如何做到的！

def extract(doc)
  names = []
  sentences = doc.xpath("//tokens")
  sentences.each do |sentence| 
    tokens = sentence.xpath("token")
    prev = nil
    tokens.each do |token|
      name = token.xpath("word").text if token.xpath("entityType").text == "PROPER_NOUN"
      names << [prev, name] if (name && prev)
      prev = name
    end
  end
  names
end

- fearless_fool

4个回答

1

这个XPath 1.0表达式：

   /*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word

选择所有“首对名词词语”

这个XPath表达式：

/*/token
  [entityType='PROPER_NOUN'
 and
   preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
  ]
   /word

选择所有“第二个成对名词”

您需要生成实际的成对，取两个产生的结果节点集的第k个节点。

基于XSLT的验证：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:copy-of select=
  "/*/token
      [entityType='PROPER_NOUN'
     and
       following-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
==============
  <xsl:copy-of select=
   "/*/token
      [entityType='PROPER_NOUN'
     and
       preceding-sibling::token[1]/entityType = 'PROPER_NOUN'
      ]
       /word
  "/>
 </xsl:template>
</xsl:stylesheet>

简单地评估这两个XPath表达式并输出它们的结果（使用适当的分隔符来可视化第一个结果的结束和第二个结果的开始）。

当应用于提供的XML文档时：

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

输出是:

<word>Newt</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Trudeau</word>

将两个结果合并（压缩）（您可以在喜欢的编程语言中指定）：

["Newt", "Gingrich"]

并且

["Garry", "Trudeau"]

当相同的转换应用于此XML文档时（请注意，现在我们有一个三元组）：

<tokens>
  <token>
    <word>Newt</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Gingrich</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Rep</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>admires</word><entityType>VERB</entityType>
  </token>
  <token>
    <word>Garry</word><entityType>PROPER_NOUN</entityType>
  </token>
  <token>
    <word>Trudeau</word><entityType>PROPER_NOUN</entityType>
  </token>
</tokens>

现在的结果是:

<word>Newt</word>
<word>Gingrich</word>
<word>Garry</word>
==============
  <word>Gingrich</word>
<word>Rep</word>
<word>Trudeau</word>

将这两个结果压缩在一起，就可以得到正确的、想要的最终结果：

["Newt", "Gingrich"],

["Gingrich", "Rep"],

并且

["Garry", "Trudeau"]

注意：

想要的结果可以使用单个XPath 2.0表达式生成。如果您对XPath 2.0解决方案感兴趣，请告诉我。

- Dimitre Novatchev

0

XPath 返回一个节点或一个节点集，但不返回分组。因此，您必须识别每个组的开始，然后获取剩余部分。

first = "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]/word"
next = "../following-sibling::token[1]/word"

doc.xpath(first).map{|word| [word.text, word.xpath(next).text] }

输出：

[["Newt", "Gingrich"], ["Garry", "Trudeau"]]

- Mark Thomas

0

XPath单独并不足够强大以完成此任务。但在XSLT中非常容易：

<xsl:for-each-group select="token" group-adjacent="entityType">
  <xsl:if test="current-grouping-key="PROPER_NOUN">
     <xsl:copy-of select="current-group">
     <xsl:text>====</xsl:text>
  <xsl:if>
</xsl:for-each-group>

- Michael Kay

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- evil otto · Accepted Answer

我会分两步来完成。第一步是选择一组节点：

//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]

这将为您提供所有以2个单词对开头的token。然后，遍历节点列表并提取./word和following-sibling::token[1]/word以获取实际的对。

使用XmlStarlet（http://xmlstar.sourceforge.net/-快速xml操作的绝佳工具），命令行如下：

xml sel -t -m "//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]" -v word -o "," -v "following-sibling::token[1]/word" -n /tmp/tok.xml

提供

Newt,Gingrich
Garry,Trudeau

XmlStarlet 也将编译该命令行为 XSLT，相关部分为：

  <xsl:for-each select="//token[entityType='PROPER_NOUN' and following-sibling::token[1][entityType='PROPER_NOUN']]">
    <xsl:value-of select="word"/>
    <xsl:value-of select="','"/>
    <xsl:value-of select="following-sibling::token[1]/word"/>
    <xsl:value-of select="'&#10;'"/>
  </xsl:for-each>

使用 Nokogiri 可能会像这样：

#parse the document
doc = Nokogiri::XML(the_document_string)

#select all tokens that start 2-word pair
pair_starts = doc.xpath '//token[entityType = "PROPER_NOUN" and following-sibling::token[1][entityType = "PROPER_NOUN"]]'

#extract each word and the following one
result = pair_starts.each_with_object([]) do |node, array|
  array << [node.at_xpath('word').text, node.at_xpath('following-sibling::token[1]/word').text]
end