如何使用Html Agility Pack获取img/src或a/href？

Question

如何使用Html Agility Pack获取img/src或a/href？

11

我想使用HTML Agility Pack从HTML页面中解析图像和href链接，但我对XML或XPath不是很了解。虽然在许多网站上查找帮助文件，但我仍然无法解决问题。此外，我在VisualStudio 2005中使用C＃。我不能流利地说英语，所以我将由衷感谢能够编写一些有用代码的人。

- iShow

而且，Html Agility Pack 可以解决相对路径吗？ - iShow

6个回答

7

这个例子和被接受的答案是错误的。它在最新版本中无法编译。我尝试了其他方法：

    private List<string> ParseLinks(string html)
    {
        var doc = new HtmlDocument(); 
        doc.LoadHtml(html);
        var nodes = doc.DocumentNode.SelectNodes("//a[@href]");
        return nodes == null ? new List<string>() : nodes.ToList().ConvertAll(
               r => r.Attributes.ToList().ConvertAll(
               i => i.Value)).SelectMany(j => j).ToList();
    }

这对我来说是有效的。

- SmallChess

2

也许我来晚了，但以下方法对我有效：

var MainImageString  = MainImageNode.Attributes.Where(i=> i.Name=="src").FirstOrDefault();

- Abhay Shiro

2

var htmlDoc = new HtmlDocument();
htmlDoc.LoadHtml(html);

string name = htmlDoc.DocumentNode
    .SelectNodes("//td/input")
    .First()
    .Attributes["value"].Value;

Source: https://html-agility-pack.net/select-nodes

- DIGITALCRIMINAL

0

晚了一点，但这是对被接受的答案进行更新的2021年版（修复了HtmlAgilityPack所做的重构）。

    var htmlDoc = new HtmlDocument();
    htmlDoc.LoadHtml(html);
    string command = "";

    // The Xpath below gets images.  
    // It is specific to a site.  Yours will vary ...
    command = "//a[contains(concat(' ', @class, ' '), 'product-card')]//img";  
    List<string> listImages=new();
    foreach(HtmlNode node in doc.DocumentNode.SelectNodes(command))
    {
        //  Using "data-src" below, but it may be "src" for you
        listImages.Add(node.Attributes["data-src"].Value);
    }

- mike g

0

您还需要考虑文档基础URL元素（<base>）和协议相对URL（例如//www.foo.com/bar/）。

更多信息请查看：

MDN上的<base>：文档基础URL元素页面
Paul Irish的协议相对URL文章
StackOverflow上的html标签的建议是什么？讨论
MSDN上的Uri构造函数（Uri，Uri）页面
StackOverflow上的Uri类不处理协议相对URL讨论

- Leonid Vasilev

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Marc Gravell · Accepted Answer

主页上的第一个示例执行的操作非常相似，但考虑以下问题：

 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm"); // would need doc.LoadHtml(htmlSource) if it is not a file
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    string href = link["href"].Value;
    // store href somewhere
 }

你可以想象对于img@src的情况，只需将每个a替换为img，将href替换为src。

甚至可以简化为：

 foreach(HtmlNode node in doc.DocumentElement
              .SelectNodes("//a/@href | //img/@src")
 {
    list.Add(node.Value);
 }

要处理相对URL，请查看 Uri 类。