输入
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
输出
foo
bar
baz
我知道htmldoc.DocumentNode.InnerText
,但它会给出foobarbaz
- 我想一次获取每个文本,而不是全部。
输入
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
输出
foo
bar
baz
我知道htmldoc.DocumentNode.InnerText
,但它会给出foobarbaz
- 我想一次获取每个文本,而不是全部。
XPATH是你的好朋友 :)
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(@"<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>");
foreach(HtmlNode node in doc.DocumentNode.SelectNodes("//text()"))
{
Console.WriteLine("text=" + node.InnerText);
}
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
这个做了你需要的事情,但我不确定这是否是最好的方式。也许你应该通过遍历其他东西而不是 DescendantNodesAndSelf 来获得最佳性能。
我需要一种解决方案,它可以提取所有文本内容但丢弃脚本和样式标签的内容。我在任何地方都找不到它,但我想出了适合自己需求的以下方法:
StringBuilder sb = new StringBuilder();
IEnumerable<HtmlNode> nodes = doc.DocumentNode.Descendants().Where( n =>
n.NodeType == HtmlNodeType.Text &&
n.ParentNode.Name != "script" &&
n.ParentNode.Name != "style");
foreach (HtmlNode node in nodes) {
Console.WriteLine(node.InnerText);
var pageContent = "{html content goes here}";
var pageDoc = new HtmlDocument();
pageDoc.LoadHtml(pageContent);
var pageText = pageDoc.DocumentNode.InnerText;
指定的HTML内容示例:
<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>
将会产生以下输出:
foo bar baz
public string html2text(string html) {
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(@"<html><body>" + html + "</body></html>");
return doc.DocumentNode.SelectSingleNode("//body").InnerText;
}
HtmlAgilityPack
)。我刚刚修改和修复了一些人的答案,使其更好地运行:
var document = new HtmlDocument();
document.LoadHtml(result);
var sb = new StringBuilder();
foreach (var node in document.DocumentNode.DescendantsAndSelf())
{
if (!node.HasChildNodes && node.Name == "#text" && node.ParentNode.Name != "script" && node.ParentNode.Name != "style")
{
string text = node.InnerText?.Trim();
if (text.HasValue() && !text.StartsWith('<') && !text.EndsWith('>'))
sb.AppendLine(System.Web.HttpUtility.HtmlDecode(text.Trim()));
}
}
https://github.com/jamietre/CsQuery
你尝试过CsQuery吗?虽然它已经不再积极维护,但它仍然是我最喜欢的用于解析HTML到文本的工具。以下是一个简单的一行代码,展示了从HTML中获取文本的方法。
var text = CQ.CreateDocument(htmlText).Text();
这是一个完整的控制台应用程序:
using System;
using CsQuery;
public class Program
{
public static void Main()
{
var html = "<div><h1>Hello World <p> some text inside h1 tag under p tag </p> </h1></div>";
var text = CQ.CreateDocument(html).Text();
Console.WriteLine(text); // Output: Hello World some text inside h1 tag under p tag
}
}
我知道 OP 只要求 HtmlAgilityPack,但是我发现 CsQuery 是另外一个不太常见的最好的解决方案之一,并且想要分享给那些可能会觉得有用的人。干杯!
/// <summary>
/// Static class that provides functions to convert HTML to plain text.
/// </summary>
public static class HtmlToText {
#region Method: ConvertFromFile (public - static)
/// <summary>
/// Converts the HTML content from a given file path to plain text.
/// </summary>
/// <param name="path">The path to the HTML file.</param>
/// <returns>The plain text version of the HTML content.</returns>
public static string ConvertFromFile(string path) {
var doc = new HtmlDocument();
// Load the HTML file
doc.Load(path);
using (var sw = new StringWriter()) {
// Convert the HTML document to plain text
ConvertTo(node: doc.DocumentNode,
outText: sw,
counters: new Dictionary<HtmlNode, int>());
sw.Flush();
return sw.ToString();
}
}
#endregion
#region Method: ConvertFromString (public - static)
/// <summary>
/// Converts the given HTML string to plain text.
/// </summary>
/// <param name="html">The HTML content as a string.</param>
/// <returns>The plain text version of the HTML content.</returns>
public static string ConvertFromString(string html) {
var doc = new HtmlDocument();
// Load the HTML string
doc.LoadHtml(html);
using (var sw = new StringWriter()) {
// Convert the HTML string to plain text
ConvertTo(node: doc.DocumentNode,
outText: sw,
counters: new Dictionary<HtmlNode, int>());
sw.Flush();
return sw.ToString();
}
}
#endregion
#region Method: ConvertTo (static)
/// <summary>
/// Helper method to convert each child node of the given node to text.
/// </summary>
/// <param name="node">The HTML node to convert.</param>
/// <param name="outText">The writer to output the text to.</param>
/// <param name="counters">Keep track of the ol/li counters during conversion</param>
private static void ConvertContentTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
// Convert each child node to text
foreach (var subnode in node.ChildNodes) {
ConvertTo(subnode, outText, counters);
}
}
#endregion
#region Method: ConvertTo (public - static)
/// <summary>
/// Converts the given HTML node to plain text.
/// </summary>
/// <param name="node">The HTML node to convert.</param>
/// <param name="outText">The writer to output the text to.</param>
public static void ConvertTo(HtmlNode node, TextWriter outText, Dictionary<HtmlNode, int> counters) {
string html;
switch (node.NodeType) {
case HtmlNodeType.Comment:
// Don't output comments
break;
case HtmlNodeType.Document:
// Convert entire content of document node to text
ConvertContentTo(node, outText, counters);
break;
case HtmlNodeType.Text:
// Ignore script and style nodes
var parentName = node.ParentNode.Name;
if ((parentName == "script") || (parentName == "style")) {
break;
}
// Get text from the text node
html = ((HtmlTextNode)node).Text;
// Ignore special closing nodes output as text
if (HtmlNode.IsOverlappedClosingElement(html) || string.IsNullOrWhiteSpace(html)) {
break;
}
// Write meaningful text (not just white-spaces) to the output
outText.Write(HtmlEntity.DeEntitize(html));
break;
case HtmlNodeType.Element:
switch (node.Name.ToLowerInvariant()) {
case "p":
case "div":
case "br":
case "table":
// Treat paragraphs and divs as new lines
outText.Write("\n");
break;
case "li":
// Treat list items as dash-prefixed lines
if (node.ParentNode.Name == "ol") {
if (!counters.ContainsKey(node.ParentNode)) {
counters[node.ParentNode] = 0;
}
counters[node.ParentNode]++;
outText.Write("\n" + counters[node.ParentNode] + ". ");
} else {
outText.Write("\n- ");
}
break;
case "a":
// convert hyperlinks to include the URL in parenthesis
if (node.HasChildNodes) {
ConvertContentTo(node, outText, counters);
}
if (node.Attributes["href"] != null) {
outText.Write($" ({node.Attributes["href"].Value})");
}
break;
case "th":
case "td":
outText.Write(" | ");
break;
}
// Convert child nodes to text if they exist (ignore a href children as they are already handled)
if (node.Name.ToLowerInvariant() != "a" && node.HasChildNodes) {
ConvertContentTo(node: node,
outText: outText,
counters: counters);
}
break;
}
}
#endregion
} // class: HtmlToText