如何从相对清晰的HTML中提取文本?

23

我的问题有点像 这个问题,但我有更多的限制:

  • 我知道文档是合理的
  • 它们非常规则(它们都来自同一来源)
  • 我想获取大约99%的可见文本
  • 大约99%的内容都是文本(它们或多或少是RTF转换为HTML)
  • 我不关心格式或段落分隔符。

是否有任何工具可以执行此操作,还是我最好使用RegexBuddy和C#?

我可以接受命令行或批处理工具以及C/C#/D库。


如果没有如此多的限制,我甚至不会想到使用正则表达式 :) - BCS
10个回答

24
今天我使用 HTML Agility Pack 编写了这段代码,它可以提取未格式化的修剪文本。
public static string ExtractText(string html)
{
    if (html == null)
    {
        throw new ArgumentNullException("html");
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    var chunks = new List<string>(); 

    foreach (var item in doc.DocumentNode.DescendantNodesAndSelf())
    {
        if (item.NodeType == HtmlNodeType.Text)
        {
            if (item.InnerText.Trim() != "")
            {
                chunks.Add(item.InnerText.Trim());
            }
        }
    }
    return String.Join(" ", chunks);
}

如果你想保留一定的格式,你可以在源代码提供的示例上进行构建。
public string Convert(string path)
{
    HtmlDocument doc = new HtmlDocument();
    doc.Load(path);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public string ConvertHtml(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    StringWriter sw = new StringWriter();
    ConvertTo(doc.DocumentNode, sw);
    sw.Flush();
    return sw.ToString();
}

public void ConvertTo(HtmlNode node, TextWriter outText)
{
    string html;
    switch (node.NodeType)
    {
        case HtmlNodeType.Comment:
            // don't output comments
            break;

        case HtmlNodeType.Document:
            ConvertContentTo(node, outText);
            break;

        case HtmlNodeType.Text:
            // script and style must not be output
            string parentName = node.ParentNode.Name;
            if ((parentName == "script") || (parentName == "style"))
                break;

            // get text
            html = ((HtmlTextNode) node).Text;

            // is it in fact a special closing node output as text?
            if (HtmlNode.IsOverlappedClosingElement(html))
                break;

            // check the text is meaningful and not a bunch of whitespaces
            if (html.Trim().Length > 0)
            {
                outText.Write(HtmlEntity.DeEntitize(html));
            }
            break;

        case HtmlNodeType.Element:
            switch (node.Name)
            {
                case "p":
                    // treat paragraphs as crlf
                    outText.Write("\r\n");
                    break;
            }

            if (node.HasChildNodes)
            {
                ConvertContentTo(node, outText);
            }
            break;
    }
}


private void ConvertContentTo(HtmlNode node, TextWriter outText)
{
    foreach (HtmlNode subnode in node.ChildNodes)
    {
        ConvertTo(subnode, outText);
    }
}

16
你可以使用支持从HTML中提取文本的NUglify
var result = Uglify.HtmlToText("<div>  <p>This is <em>   a text    </em></p>   </div>");
Console.WriteLine(result.Code);   // prints: This is a text

由于它使用HTML5自定义解析器,因此应该非常健壮(特别是如果文档不包含任何错误),并且速度非常快(没有正则表达式参与,而是使用纯递归下降解析器)


2
这个功能很出色, 而且非常简单易懂。谢谢! - flytzen
1
对我来说是一个大的时间节省者。谢谢。 - dotcoder

12

您需要使用HTML Agility Pack

您可能希望使用LINQ查询和Descendants调用来查找元素,然后获取其InnerText


你的意思是我需要学习 LINQ 吗?(令人惊讶的是,这真的是我遇到的第一件看起来需要使用 LINQ 的事情,但是反过来说,我通常不在这个领域) - BCS
1
@BCS:你不一定需要学习LINQ,但是使用LINQ会让编程变得更加容易。我猜想,有效地使用LINQ至少可以使你的代码缩短120%,并且也更易于理解。 - SLaks
1
敏捷包(Agility pack)比编写自己的DOM处理程序好太多了。 - Random Developer
事实上,LINQ并不是最简单的解决方案,但这只是因为有一个示例项目html2text已经完成了我想要的90%,而最后的1%只需要添加几行if(...) return;代码即可(然而文档并不是很好)。 - BCS

5
这是我正在使用的代码:
using System.Web;
public static string ExtractText(string html)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    string s =reg.Replace(html, " ");
    s = HttpUtility.HtmlDecode(s);
    return s;
}

1
这在某些情况下可能是可以接受的。但请注意,任何出现在注释或CDATA块中的右尖括号都会破坏此正则表达式,更不用说正则表达式可能会损坏<script><style>标签的内容了。此外,虽然(据我所知)标准要求在属性值中使用角括号进行编码,但现代浏览器对像<div data:tree="parent>child">Some text</div>这样的东西也很宽容,这也会破坏您的正则表达式。 - Roy Tinker
1
在这里使用IgnoreCase选项的目的是什么? - JohnnyHK

3
这是最佳方法:

  public static string StripHTML(string HTMLText)
    {
        Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
        return reg.Replace(HTMLText, "");
    }

1
选择一个来自谷歌的链接,查询条件为“Html RegEx”,并且仅返回1个结果。 -> https://dev59.com/X3I-5IYBdhLWcg3wq6do - BCS

3

如果您将HTML加载到C#中,然后使用mshtml.dll或C#/WinForms中的WebBrowser控件,就可以将整个HTML文档视为一棵树,遍历该树以捕获InnerText对象,这相对简单。

或者,您还可以使用document.all,它会将树展平,然后您可以迭代遍历树,再次捕获InnerText。

以下是一个示例:

        WebBrowser webBrowser = new WebBrowser();
        webBrowser.Url = new Uri("url_of_file"); //can be remote or local
        webBrowser.DocumentCompleted += delegate
        {
            HtmlElementCollection collection = webBrowser.Document.All;
            List<string> contents = new List<string>();

            /*
             * Adds all inner-text of a tag, including inner-text of sub-tags
             * ie. <html><body><a>test</a><b>test 2</b></body></html> would do:
             * "test test 2" when collection[i] == <html>
             * "test test 2" when collection[i] == <body>
             * "test" when collection[i] == <a>
             * "test 2" when collection[i] == <b>
             */
            for (int i = 0; i < collection.Count; i++)
            {
                if (!string.IsNullOrEmpty(collection[i].InnerText))
                {
                    contents.Add(collection[i].InnerText);
                }
            }

            /*
             * <html><body><a>test</a><b>test 2</b></body></html>
             * outputs: test test 2|test test 2|test|test 2
             */
            string contentString = string.Join("|", contents.ToArray());
            MessageBox.Show(contentString);
        };

希望这能有所帮助!

谷歌搜索mshtml.dll会给出大部分页面或错误报告、错误修复。您有文档链接吗? - BCS
我刚刚编辑了我的帖子,并使用WebBrowser控件提供了一个示例。 - AlishahNovin
不幸的是,这种方法在Server Core系统上不起作用,因为它们没有安装WebBrowser组件。 - Dmitrii Erokhin

3
这是我开发的一个类,用于实现相同的功能。所有可用的HTML解析库速度都太慢,正则表达式的速度也太慢了。代码注释中解释了功能。根据我的基准测试,在亚马逊登陆页面上测试时,这段代码比HTML Agility Pack的等效代码快10倍以上(如下所示)。
/// <summary>
/// The fast HTML text extractor class is designed to, as quickly and as ignorantly as possible,
/// extract text data from a given HTML character array. The class searches for and deletes
/// script and style tags in a first and second pass, with an optional third pass to do the same
/// to HTML comments, and then copies remaining non-whitespace character data to an ouput array.
/// All whitespace encountered is replaced with a single whitespace in to avoid multiple
/// whitespace in the output.
///
/// Note that the returned text content still may have named character and numbered character
/// references within that, when decoded, may produce multiple whitespace.
/// </summary>
public class FastHtmlTextExtractor
{

    private readonly char[] SCRIPT_OPEN_TAG = new char[7] { '<', 's', 'c', 'r', 'i', 'p', 't' };
    private readonly char[] SCRIPT_CLOSE_TAG = new char[9] { '<', '/', 's', 'c', 'r', 'i', 'p', 't', '>' };

    private readonly char[] STYLE_OPEN_TAG = new char[6] { '<', 's', 't', 'y', 'l', 'e' };
    private readonly char[] STYLE_CLOSE_TAG = new char[8] { '<', '/', 's', 't', 'y', 'l', 'e', '>' };

    private readonly char[] COMMENT_OPEN_TAG = new char[3] { '<', '!', '-' };
    private readonly char[] COMMENT_CLOSE_TAG = new char[3] { '-', '-', '>' };

    private int[] m_deletionDictionary;

    public string Extract(char[] input, bool stripComments = false)
    {
        var len = input.Length;
        int next = 0;

        m_deletionDictionary = new int[len];

        // Whipe out all text content between style and script tags.
        FindAndWipe(SCRIPT_OPEN_TAG, SCRIPT_CLOSE_TAG, input);
        FindAndWipe(STYLE_OPEN_TAG, STYLE_CLOSE_TAG, input);

        if(stripComments)
        {
            // Whipe out everything between HTML comments.
            FindAndWipe(COMMENT_OPEN_TAG, COMMENT_CLOSE_TAG, input);
        }

        // Whipe text between all other tags now.
        while(next < len)
        {
            next = SkipUntil(next, '<', input);

            if(next < len)
            {
                var closeNext = SkipUntil(next, '>', input);

                if(closeNext < len)
                {
                    m_deletionDictionary[next] = (closeNext + 1) - next;
                    WipeRange(next, closeNext + 1, input);
                }

                next = closeNext + 1;
            }
        }

        // Collect all non-whitespace and non-null chars into a new
        // char array. All whitespace characters are skipped and replaced
        // with a single space char. Multiple whitespace is ignored.
        var lastSpace = true;
        var extractedPos = 0;
        var extracted = new char[len];

        for(next = 0; next < len; ++next)
        {
            if(m_deletionDictionary[next] > 0)
            {
                next += m_deletionDictionary[next];
                continue;
            }

            if(char.IsWhiteSpace(input[next]) || input[next] == '\0')
            {
                if(lastSpace)
                {
                    continue;
                }

                extracted[extractedPos++] = ' ';
                lastSpace = true;
            }
            else
            {
                lastSpace = false;
                extracted[extractedPos++] = input[next];
            }
        }

        return new string(extracted, 0, extractedPos);
    }

    /// <summary>
    /// Does a search in the input array for the characters in the supplied open and closing tag
    /// char arrays. Each match where both tag open and tag close are discovered causes the text
    /// in between the matches to be overwritten by Array.Clear().
    /// </summary>
    /// <param name="openingTag">
    /// The opening tag to search for.
    /// </param>
    /// <param name="closingTag">
    /// The closing tag to search for.
    /// </param>
    /// <param name="input">
    /// The input to search in.
    /// </param>
    private void FindAndWipe(char[] openingTag, char[] closingTag, char[] input)
    {
        int len = input.Length;
        int pos = 0;

        do
        {
            pos = FindNext(pos, openingTag, input);

            if(pos < len)
            {
                var closenext = FindNext(pos, closingTag, input);

                if(closenext < len)
                {
                    m_deletionDictionary[pos - openingTag.Length] = closenext - (pos - openingTag.Length);
                    WipeRange(pos - openingTag.Length, closenext, input);
                }

                if(closenext > pos)
                {
                    pos = closenext;
                }
                else
                {
                    ++pos;
                }
            }
        }
        while(pos < len);
    }

    /// <summary>
    /// Skips as many characters as possible within the input array until the given char is
    /// found. The position of the first instance of the char is returned, or if not found, a
    /// position beyond the end of the input array is returned.
    /// </summary>
    /// <param name="pos">
    /// The starting position to search from within the input array.
    /// </param>
    /// <param name="c">
    /// The character to find.
    /// </param>
    /// <param name="input">
    /// The input to search within.
    /// </param>
    /// <returns>
    /// The position of the found character, or an index beyond the end of the input array.
    /// </returns>
    private int SkipUntil(int pos, char c, char[] input)
    {
        if(pos >= input.Length)
        {
            return pos;
        }

        do
        {
            if(input[pos] == c)
            {
                return pos;
            }

            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    /// <summary>
    /// Clears a given range in the input array.
    /// </summary>
    /// <param name="start">
    /// The start position from which the array will begin to be cleared.
    /// </param>
    /// <param name="end">
    /// The end position in the array, the position to clear up-until.
    /// </param>
    /// <param name="input">
    /// The source array wherin the supplied range will be cleared.
    /// </param>
    /// <remarks>
    /// Note that the second parameter is called end, not lenghth. This parameter is meant to be
    /// a position in the array, not the amount of entries in the array to clear.
    /// </remarks>
    private void WipeRange(int start, int end, char[] input)
    {
        Array.Clear(input, start, end - start);
    }

    /// <summary>
    /// Finds the next occurance of the supplied char array within the input array. This search
    /// ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position to start searching from.
    /// </param>
    /// <param name="what">
    /// The sequence of characters to find.
    /// </param>
    /// <param name="input">
    /// The input array to perform the search on.
    /// </param>
    /// <returns>
    /// The position of the end of the first matching occurance. That is, the returned position
    /// points to the very end of the search criteria within the input array, not the start. If
    /// no match could be found, a position beyond the end of the input array will be returned.
    /// </returns>
    public int FindNext(int pos, char[] what, char[] input)
    {
        do
        {
            if(Next(ref pos, what, input))
            {
                return pos;
            }
            ++pos;
        }
        while(pos < input.Length);

        return pos;
    }

    /// <summary>
    /// Probes the input array at the given position to determine if the next N characters
    /// matches the supplied character sequence. This check ignores whitespace.
    /// </summary>
    /// <param name="pos">
    /// The position at which to check within the input array for a match to the supplied
    /// character sequence.
    /// </param>
    /// <param name="what">
    /// The character sequence to attempt to match. Note that whitespace between characters
    /// within the input array is accebtale.
    /// </param>
    /// <param name="input">
    /// The input array to check within.
    /// </param>
    /// <returns>
    /// True if the next N characters within the input array matches the supplied search
    /// character sequence. Returns false otherwise.
    /// </returns>
    public bool Next(ref int pos, char[] what, char[] input)
    {
        int z = 0;

        do
        {
            if(char.IsWhiteSpace(input[pos]) || input[pos] == '\0')
            {
                ++pos;
                continue;
            }

            if(input[pos] == what[z])
            {
                ++z;
                ++pos;
                continue;
            }

            return false;
        }
        while(pos < input.Length && z < what.Length);

        return z == what.Length;
    }
}

在 HtmlAgilityPack 中的等效操作:

// Where m_whitespaceRegex is a Regex with [\s].
// Where sampleHtmlText is a raw HTML string.

var extractedSampleText = new StringBuilder();
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(sampleHtmlText);

if(doc != null && doc.DocumentNode != null)
{
    foreach(var script in doc.DocumentNode.Descendants("script").ToArray())
    {
        script.Remove();
    }

    foreach(var style in doc.DocumentNode.Descendants("style").ToArray())
    {
        style.Remove();
    }

    var allTextNodes = doc.DocumentNode.SelectNodes("//text()");
    if(allTextNodes != null && allTextNodes.Count > 0)
    {
        foreach(HtmlNode node in allTextNodes)
        {
            extractedSampleText.Append(node.InnerText);
        }
    }

    var finalText = m_whitespaceRegex.Replace(extractedSampleText.ToString(), " ");
}

1

从命令行中,您可以像这样使用Lynx文本浏览器

如果您想以格式化的输出方式下载网页(即没有HTML标签,而是像在Lynx中一样呈现),则输入:

lynx -dump URL > filename

如果页面上有任何链接,那些链接的URL地址将会包含在下载页面的末尾。
您可以使用-nolist禁用链接列表。例如:
lynx -dump -nolist https://dev59.com/FXI95IYBdhLWcg3w3iHG#10469619 > filename

1
在这里您可以下载一个工具及其源代码,用于将HTML和XAML相互转换:XAML/HTML转换器
它包含一个HTML解析器(这样的东西显然必须比您的标准XML解析器更加宽容),并且您可以像处理XML一样遍历HTML。

0

尝试下一个代码

string? GetBodyPreview(string? htmlBody)
{
    Regex reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    htmlBody = reg.Replace(Crop(htmlBody, "<body ", 1000), "");
    return Crop(HttpUtility.HtmlDecode(htmlBody), "", 255);

    string Crop(string? text, string start, int maxLength)
    {
        var s = text?.IndexOf(start);
        var r = (s >= 0 ? text?.Substring(text.IndexOf(start)) : text) ?? string.Empty;
        return r.Substring(0, Int32.Min(r.Length, maxLength)).TrimStart();
    }
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接