我该如何解析这个HTML文档以获取我想要的内容？

Question

我该如何解析这个HTML文档以获取我想要的内容？

4

我目前正在尝试解析一个HTML文档，以检索其中所有的脚注；该文档包含了许多这样的脚注。我无法弄清楚要使用哪些表达式来提取我想要的所有内容。问题在于，类（例如“calibre34”）在每个文档中都是随机的。唯一找到脚注位置的方法是搜索“hide”，然后它总是跟着文本，并用< /td>标签关闭。下面是HTML文档中一个脚注的示例，我只想要文本。有什么想法吗？谢谢大家！

<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>

- JMarsh

3

用什么进行解析？我希望你不是指正则表达式... 请在您发布的内容中标记您用来解析HTML的语言，否则没有人能够帮助您。 - qJake

你能否查找带有 x-ref 类的 a 标签并获取最近的 td 父元素？ - Peter Olson

使用 XDocument (XML to LINQ) 或 XmlDocument (POCO) 来解析你的 HTML。这两个 XML 库已经包含在 .NET/C# 中，非常强大。 - qJake

td元素的数量呢？它们是相同的吗？我的意思是，脚注是否总是在同一个td元素中？如果您想在Java中解析，可以选择Jericho HTML解析器。 - Pradeep

所有脚注都在td标签中，但许多其他内容也在其中。这些HTML文档非常庞大，包含大量的内容和标签，它们写得非常糟糕，我的工作是提取脚注，但我不想坐在那里复制粘贴30年。另外，感谢SpikeX，我会看一下的。 - JMarsh

还有一个技巧，如果所有内容都在 [hide] 之后，那么您可以查找长度大于某个阈值的内容（例如，脚注的长度将大于某个长度，例如50），并且在考虑长度时，请确保没有 '<' 或 '>' 或任何其他 HTML 标签会出现在其中。 - Pradeep

2个回答

3

如何使用MSHTML解析HTML源代码？以下是演示代码，请享用。

public class CHtmlPraseDemo
{
    private string strHtmlSource;
    public mshtml.IHTMLDocument2 oHtmlDoc;
    public CHtmlPraseDemo(string url)
    {
        GetWebContent(url);
        oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
        oHtmlDoc.write(strHtmlSource);
    }
    public List<String> GetTdNodes(string TdClassName)
    {
        List<String> listOut = new List<string>();
        IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
        IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
        foreach (IHTMLElement item in iec)
        {
            if (item.className == TdClassName)
            {
                listOut.Add(item.innerHTML);
            }
        }
        return listOut;
    }
    void GetWebContent(string strUrl)
    {
        WebClient wc = new WebClient();
        strHtmlSource = wc.DownloadString(strUrl);
    }



}

class Program
{
 static void Main(string[] args)
    {
        CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");

        Console.Write(oH.oHtmlDoc.title);
        List<string> l = oH.GetTdNodes("x");
        foreach (string n in l)
        {
            Console.WriteLine("new td");
            Console.WriteLine(n.ToString());

        }

        Console.Read();
    }
}

- Chachi

我发现mshtml很糟糕。任何自闭合标签，如<br />都会完全破坏您的解析尝试。我目前正在寻找一种新的解析方法。 - JSON

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marcel N. · Accepted Answer

使用 HTMLAgilityPack 加载 HTML 文档，然后使用以下 XPath 提取脚注:

//td[text()='[hide]']/following-sibling::td

基本上，它首先选择包含 [hide] 的所有 td 节点，最后转到并选择它们的下一个兄弟节点。所以是下一个 td。一旦你拥有了这些节点的集合，就可以提取它们的内部文本（在 C# 中，使用 HtmlAgilityPack 提供的支持）。