使用Html Agility Pack从HTML中获取所有文本

40

输入

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

输出

foo
bar
baz

我知道htmldoc.DocumentNode.InnerText,但它会给出foobarbaz - 我想一次获取每个文本,而不是全部。

8个回答

73

XPATH是你的好朋友 :)

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");

foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
    Console.WriteLine("text=" + node.InnerText);
}

这对我非常有效。无论我扔给它什么,甚至是由旧的 CMS 生成的糟糕的 HTML 片段,它都能很好地应对。 - Chris
4
好的。下面是一个小修改,可以处理没有文本的情况(从而避免运行时异常)。HtmlNodeCollection textNodes = doc.DocumentNode.SelectNodes("//text()"); 如果(textNodes != null) foreach(HtmlNode node in textNodes) result += node.InnerText; - Josh
@Josh 正是我所需要的。 - richard

13
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
    if (!node.HasChildNodes)
    {
        string text = node.InnerText;
        if (!string.IsNullOrEmpty(text))
            sb.AppendLine(text.Trim());
    }
}

这个做了你需要的事情,但我不确定这是否是最好的方式。也许你应该通过遍历其他东西而不是 DescendantNodesAndSelf 来获得最佳性能。


13

我需要一种解决方案,它可以提取所有文本内容但丢弃脚本和样式标签的内容。我在任何地方都找不到它,但我想出了适合自己需求的以下方法:

StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n => 
    n.NodeType == HtmlNodeType.Text &&
    n.ParentNode.Name != "script" &&
    n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
    Console.WriteLine(node.InnerText);

喜欢这个解决方案,它还可以去除CSS和脚本 :-) - Joop Stringer

11
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;

指定的HTML内容示例:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

将会产生以下输出:

foo bar baz

1
这将使CSS成为页面文本的一部分,在我的情况下不是所期望的。 - sobelito

5
public string html2text(string html) {
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(@"<html><body>" + html + "</body></html>");
    return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}

这个解决方案基于 Html Agility Pack。你也可以通过NuGet安装它(包名称:HtmlAgilityPack)。

如果您的HTML参数中包含<b>标签,则在将其转换为文本时,它会将HTML转换为换行符(\n),这是不正确的。 - jnoreiga

0

我刚刚修改和修复了一些人的答案,使其更好地运行:

var document = new HtmlDocument();
        document.LoadHtml(result);
        var sb = new StringBuilder();
        foreach (var node in document.DocumentNode.DescendantsAndSelf())
        {
            if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
            {
                string text = node.InnerText?.Trim();
                if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
                    sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
            }
        }

0

https://github.com/jamietre/CsQuery

你尝试过CsQuery吗?虽然它已经不再积极维护,但它仍然是我最喜欢的用于解析HTML到文本的工具。以下是一个简单的一行代码,展示了从HTML中获取文本的方法。

var text = CQ.CreateDocument(htmlText).Text();

这是一个完整的控制台应用程序:

using System;
using CsQuery;

public class Program
{
    public static void Main()
    {
        var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
        var text = CQ.CreateDocument(html).Text();
        Console.WriteLine(text); // Output: Hello World  some text inside h1 tag under p tag

    }
}

我知道 OP 只要求 HtmlAgilityPack,但是我发现 CsQuery 是另外一个不太常见的最好的解决方案之一,并且想要分享给那些可能会觉得有用的人。干杯!


0
可能是类似下面这样的(我在谷歌上找到了一个非常基本的版本,并扩展它以处理超链接、无序列表、有序列表、div和表格)
        /// <summary>
    /// Static class that provides functions to convert HTML to plain text.
    /// </summary>
    public static class HtmlToText {

        #region Method: ConvertFromFile (public - static)
        /// <summary>
        /// Converts the HTML content from a given file path to plain text.
        /// </summary>
        /// <param name="path">The path to the HTML file.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromFile(string path) {
            var doc = new HtmlDocument();

            // Load the HTML file
            doc.Load(path);

            using (var sw = new StringWriter()) {
                // Convert the HTML document to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertFromString (public - static)
        /// <summary>
        /// Converts the given HTML string to plain text.
        /// </summary>
        /// <param name="html">The HTML content as a string.</param>
        /// <returns>The plain text version of the HTML content.</returns>
        public static string ConvertFromString(string html) {
            var doc = new HtmlDocument();

            // Load the HTML string
            doc.LoadHtml(html);

            using (var sw = new StringWriter()) {
                // Convert the HTML string to plain text
                ConvertTo(node: doc.DocumentNode,
                          outText: sw,
                          counters: new Dictionary<HtmlNode, int>());
                sw.Flush();
                return sw.ToString();
            }
        }
        #endregion

        #region Method: ConvertTo (static)
        /// <summary>
        /// Helper method to convert each child node of the given node to text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        /// <param name="counters">Keep track of the ol/li counters during conversion</param>
        private static void ConvertContentTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            // Convert each child node to text
            foreach (var subnode in node.ChildNodes) {
                ConvertTo(subnode, outText, counters);
            }
        }
        #endregion

        #region Method: ConvertTo (public - static)
        /// <summary>
        /// Converts the given HTML node to plain text.
        /// </summary>
        /// <param name="node">The HTML node to convert.</param>
        /// <param name="outText">The writer to output the text to.</param>
        public static void ConvertTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
            string html;

            switch (node.NodeType) {
                case HtmlNodeType.Comment:
                    // Don't output comments
                    break;
                case HtmlNodeType.Document:
                    // Convert entire content of document node to text
                    ConvertContentTo(node, outText, counters);
                    break;
                case HtmlNodeType.Text:
                    // Ignore script and style nodes
                    var parentName = node.ParentNode.Name;
                    if ((parentName == "script") || (parentName == "style")) {
                        break;
                    }

                    // Get text from the text node
                    html = ((HtmlTextNode)node).Text;

                    // Ignore special closing nodes output as text
                    if (HtmlNode.IsOverlappedClosingElement(html) || string.IsNullOrWhiteSpace(html)) {
                        break;
                    }

                    // Write meaningful text (not just white-spaces) to the output
                    outText.Write(HtmlEntity.DeEntitize(html));
                    break;
                case HtmlNodeType.Element:
                    switch (node.Name.ToLowerInvariant()) {
                        case "p":
                        case "div":
                        case "br":
                        case "table":
                            // Treat paragraphs and divs as new lines
                            outText.Write("\n");
                            break;
                        case "li":
                            // Treat list items as dash-prefixed lines
                            if (node.ParentNode.Name == "ol") {
                                if (!counters.ContainsKey(node.ParentNode)) {
                                    counters[node.ParentNode] = 0;
                                }
                                counters[node.ParentNode]++;
                                outText.Write("\n" + counters[node.ParentNode] + ". ");
                            } else {
                                outText.Write("\n- ");
                            }
                            break;
                        case "a":
                            // convert hyperlinks to include the URL in parenthesis
                            if (node.HasChildNodes) {
                                ConvertContentTo(node, outText, counters);
                            }
                            if (node.Attributes["href"] != null) {
                                outText.Write($" ({node.Attributes["href"].Value})");
                            }
                            break;
                        case "th":
                        case "td":
                            outText.Write(" | ");
                            break;
                    }

                    // Convert child nodes to text if they exist (ignore a href children as they are already handled)
                    if (node.Name.ToLowerInvariant() != "a" && node.HasChildNodes) {
                        ConvertContentTo(node: node,
                                         outText: outText,
                                         counters: counters);
                    }
                    break;
            }
        }
        #endregion

    } // class: HtmlToText 

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接