使用Agility Pack解析HTML

Question

使用Agility Pack解析HTML

3

我有一个需要解析的HTML（如下所示）

<div id="mailbox" class="div-w div-m-0">
    <h2 class="h-line">InBox</h2>
    <div id="mailbox-table">
        <table id="maillist">
            <tr>
                <th>From</th>
                <th>Subject</th>
                <th>Date</th>
            </tr>
            <tr onclick="location='readmail.html?mid=welcome'" style="font-weight: bold;">
                <td>no-reply@somemail.net</td>
                <td>
                    <a href="readmail.html?mid=welcome">Hi, Welcome</a>
                </td>
                <td>
                    <span title="2016-02-16 13:23:50 UTC">just now</span>
                </td>
            </tr>
            <tr onclick="location='readmail.html?mid=T0wM6P'" style="font-weight: bold;">
                <td>someone@outlook.com</td>
                <td>
                    <a href="readmail.html?mid=T0wM6P">sa</a>
                </td>
                <td>
                    <span title="2016-02-16 13:24:04">just now</span>
                </td>
            </tr>
        </table>
    </div>
</div>

我需要解析 <tr onclick= 标签中的链接和 <td> 标签中的电子邮件地址。

到目前为止，我已经成功获取了html中第一个出现的电子邮件/链接。

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);

有人能展示一下如何正确地完成吗？基本上我想要做的是从所述标签中提取所有电子邮件地址和链接。

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[@onclick]"))
{
    HtmlAttribute att = link.Attributes["onclick"];
    Console.WriteLine(att.Value);
}

编辑：我需要将解析后的值以一对一的方式存储在一个类（列表）中。电子邮件（链接）和发件人电子邮件。

public class ClassMailBox
{
    public string From { get; set; } 
    public string LinkToMail { get; set; }    

}

- Tagyoureit

我也尝试过HtmlAgilityPack，但它对XPath的支持不太好。 - Fab

你尝试过使用 CssPath 功能吗？ - Fab

1

@Tagyoureit 我试了你的代码，它打印出了两个tr项：location='readmail.html?mid=welcome' location='readmail.html?mid=T0wM6P'我正在使用.NET 4.5和HtmlAgilityPack 1.4.9。你能否请检查一下responseFromServer变量中获取的HTML是否完整。谢谢 - avenet

是的，你说得对，我正在解析过时的HTML。下一个问题是如何获取发件人的电子邮件地址？ - Tagyoureit

1

好的，我通过创建第二个XPath包含第一个td子元素成功获取了电子邮件。您想在同一查询中为td和tr创建XPath，还是更喜欢分别为查询和td创建XPath呢？我建议您后者。 - avenet

第二个我认为会更容易维护。请发布您的解决方案，以便我可以将问题标记为已回答。 - Tagyoureit

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- avenet · Accepted Answer

您可以编写以下代码：

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(responseFromServer);

foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//tr[@onclick]"))
{
    HtmlAttribute att = link.Attributes["onclick"];
    ClassMailBox classMailbox = new ClassMailBox() { LinkToMail = att.Value };
    classMailBoxes.Add(classMailbox);
}

int currentPosition = 0;

foreach (HtmlNode tableDef in doc.DocumentNode.SelectNodes("//tr[@onclick]/td[1]"))
{
    classMailBoxes[currentPosition].From = tableDef.InnerText;
    currentPosition++;
}

为了让这段代码简单，我假设了以下几点：

电子邮件始终位于包含onlink属性的tr中的第一个td中
每个具有onlink属性的tr都包含一个电子邮件地址

如果不满足这些条件，此代码将无法正常工作，并可能引发一些异常（IndexOutOfRangeExceptions），或者匹配错误的电子邮件地址。