使用Html Agility Pack获取两个HTML标签之间的内容

Question

使用Html Agility Pack获取两个HTML标签之间的内容

5

我们有一个非常庞大的帮助文档，是在Word中创建的，并用它生成了一个更加庞大和难以操作的HTML文档。使用C#和这个库，我想要在我的应用程序中仅获取和显示该文件的一个部分。这些部分如下所示：

<!--logical section starts here -->
<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section A</a></h1>
</div>
 <div> Lots of unnecessary markup for simple formatting... </div>
 .....
<!--logical section ends here -->

<div>
<h1><span style='mso-spacerun:yes'></span><a name="_Toc325456104">Section B</a></h1>
</div>

从逻辑上讲，有一个带有部分名称的H1标签在a标签中。我想选择外部包含div中的所有内容，直到遇到另一个h1并排除该div。

每个章节名称都位于<a>标签下的h1中，该标题具有多个子元素（每个约6个）
逻辑部分由注释标记
这些注释在实际文档中不存在

我的尝试：

var startNode = helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(., '"+sectionName+"')]");
//go up one level from the a node to the h1 element
startNode=startNode.ParentNode;

//get the start index as the index of the div containing the h1 element
int startNodeIndex = startNode.ParentNode.ChildNodes.IndexOf(startNode);

//here I am not sure how to get the endNode location. 
var endNode =?;

int endNodeIndex = endNode.ParentNode.ChildNodes.IndexOf(endNode);

//select everything from the start index to the end index
var nodes = startNode.ParentNode.ChildNodes.Where((n, index) => index >= startNodeIndex && index <= endNodeIndex).Select(n => n);

由于我找不到相关文档，我不知道如何从起始节点到达下一个 h1 元素。如果有建议，请告诉我，谢谢。

- Rondel

2个回答

0

所以，你真正想要的结果是围绕在h1标签周围的div吗？如果是的话，那么这应该可以工作。

helpDocument.DocumentNode.SelectSingleNode("//h1/a[contains(@name, '"+sectionName+"')]/ancestor::div");

根据您的Html，也可以与SelectNodes一起使用。就像这样：

helpDocument.DocumentNode.SelectNodes("//h1/a[starts-with(@name,'_Toc')]/ancestor::div");

哦，测试时我注意到对我不起作用的是contains方法中的点，一旦我将其更改为name属性，一切都正常了。

- shriek

不太对。我想要围绕 h1 标签的 div，但我还想获取所有未来的 div/span 直到下一个 h1 标签所包含的 div。谢谢。 - Rondel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jacob Proffitt · Accepted Answer

我认为这样可以解决问题，但是它假设H1标签只出现在部分标题中。如果不是这种情况，您可以添加Where语句来检查任何找到的H1节点上的其他过滤器的后代。请注意，这将包括找到的div的所有同级元素，直到它遇到下一个具有部分名称的元素。

private List<HtmlNode> GetSection(HtmlDocument helpDocument, string SectionName)
{
    HtmlNode startNode = helpDocument.DocumentNode.Descendants("div").Where(d => d.InnerText.Equals(SectionName, StringComparison.InvariantCultureIgnoreCase)).FirstOrDefault();
    if (startNode == null)
        return null; // section not found

    List<HtmlNode> section = new List<HtmlNode>();
    HtmlNode sibling = startNode.NextSibling;
    while (sibling != null && sibling.Descendants("h1").Count() <= 0)
    {
        section.Add(sibling);
        sibling = sibling.NextSibling;
    }

    return section;
}