使用Html Agility Pack删除所有HTML标记

19

我有一个类似这样的HTML字符串:

<html><body><p>foo <a href='http://www.example.com'>bar</a> baz</p></body></html>

我希望能够去除所有HTML标签,使得结果字符串变为:

foo bar baz

我从stackoverflow的另一篇文章中找到了这个函数(使用Html Agility Pack):

  Public Shared Function stripTags(ByVal html As String) As String
    Dim plain As String = String.Empty
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument

    htmldoc.LoadHtml(html)
    Dim invalidNodes As HtmlAgilityPack.HtmlNodeCollection = htmldoc.DocumentNode.SelectNodes("//html|//body|//p|//a")

    If Not htmldoc Is Nothing Then
      For Each node In invalidNodes
        node.ParentNode.RemoveChild(node, True)
      Next
    End If

    Return htmldoc.DocumentNode.WriteContentTo
  End Function
很遗憾,这并没有返回我所期望的结果,而是返回了:
bazbarfoo

请问我做错了什么 - 这是最好的方法吗?

祝好,愉快的编码!

更新:通过下面的答案,我得出了这个函数,可能对其他人有用:

  Public Shared Function stripTags(ByVal html As String) As String
    Dim htmldoc As New HtmlAgilityPack.HtmlDocument
    htmldoc.LoadHtml(html.Replace("</p>", "</p>" & New String(Environment.NewLine, 2)).Replace("<br/>", Environment.NewLine))
    Return htmldoc.DocumentNode.InnerText
  End Function
5个回答

35
为什么不直接返回htmldoc.DocumentNode.InnerText而要删除所有非文本节点呢?这样会得到你想要的结果。

2
它会移除白名单中未找到的标签和属性。
Public NotInheritable Class HtmlSanitizer
    Private Sub New()
    End Sub
    Private Shared ReadOnly Whitelist As IDictionary(Of String, String())
    Private Shared DeletableNodesXpath As New List(Of String)()

    Shared Sub New()
        Whitelist = New Dictionary(Of String, String())() From { _
            {"a", New () {"href"}}, _
            {"strong", Nothing}, _
            {"em", Nothing}, _
            {"blockquote", Nothing}, _
            {"b", Nothing}, _
            {"p", Nothing}, _
            {"ul", Nothing}, _
            {"ol", Nothing}, _
            {"li", Nothing}, _
            {"div", New () {"align"}}, _
            {"strike", Nothing}, _
            {"u", Nothing}, _
            {"sub", Nothing}, _
            {"sup", Nothing}, _
            {"table", Nothing}, _
            {"tr", Nothing}, _
            {"td", Nothing}, _
            {"th", Nothing} _
        }
    End Sub

    Public Shared Function Sanitize(input As String) As String
        If input.Trim().Length < 1 Then
            Return String.Empty
        End If
        Dim htmlDocument = New HtmlDocument()

        htmlDocument.LoadHtml(input)
        SanitizeNode(htmlDocument.DocumentNode)
        Dim xPath As String = HtmlSanitizer.CreateXPath()

        Return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath)
    End Function

    Private Shared Sub SanitizeChildren(parentNode As HtmlNode)
        For i As Integer = parentNode.ChildNodes.Count - 1 To 0 Step -1
            SanitizeNode(parentNode.ChildNodes(i))
        Next
    End Sub

    Private Shared Sub SanitizeNode(node As HtmlNode)
        If node.NodeType = HtmlNodeType.Element Then
            If Not Whitelist.ContainsKey(node.Name) Then
                If Not DeletableNodesXpath.Contains(node.Name) Then
                    'DeletableNodesXpath.Add(node.Name.Replace("?",""));
                    node.Name = "removeableNode"
                    DeletableNodesXpath.Add(node.Name)
                End If
                If node.HasChildNodes Then
                    SanitizeChildren(node)
                End If

                Return
            End If

            If node.HasAttributes Then
                For i As Integer = node.Attributes.Count - 1 To 0 Step -1
                    Dim currentAttribute As HtmlAttribute = node.Attributes(i)
                    Dim allowedAttributes As String() = Whitelist(node.Name)
                    If allowedAttributes IsNot Nothing Then
                        If Not allowedAttributes.Contains(currentAttribute.Name) Then
                            node.Attributes.Remove(currentAttribute)
                        End If
                    Else
                        node.Attributes.Remove(currentAttribute)
                    End If
                Next
            End If
        End If

        If node.HasChildNodes Then
            SanitizeChildren(node)
        End If
    End Sub

    Private Shared Function StripHtml(html As String, xPath As String) As String
        Dim htmlDoc As New HtmlDocument()
        htmlDoc.LoadHtml(html)
        If xPath.Length > 0 Then
            Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)
            For Each node As HtmlNode In invalidNodes
                node.ParentNode.RemoveChild(node, True)
            Next
        End If
        Return htmlDoc.DocumentNode.WriteContentTo()


    End Function

    Private Shared Function CreateXPath() As String
        Dim _xPath As String = String.Empty
        For i As Integer = 0 To DeletableNodesXpath.Count - 1
            If i IsNot DeletableNodesXpath.Count - 1 Then
                _xPath += String.Format("//{0}|", DeletableNodesXpath(i).ToString())
            Else
                _xPath += String.Format("//{0}", DeletableNodesXpath(i).ToString())
            End If
        Next
        Return _xPath
    End Function
End Class

在您的字典中,除了第一个条目外,所有条目的值都为“Nothing”。您可能可以跳过使用映射而改用列表。 - Zasz
一个 List 可能会慢一些,但不太可能成为瓶颈。话虽如此,在 .Net 3.5+ 上,我建议使用 HashSet 而不是 List 来实现这个目的。 - Brian
正如Brian所指出的,这里选择的数据结构“不太可能成为瓶颈”。与对每个节点执行的操作相比,ContainsKey可能只占很小一部分,不是吗? - Oskar Austegard

1

你似乎假设ForEach从头到尾遍历文档。如果你想确保这样做,使用普通的for循环。你甚至不能确定节点是否按照你期望的顺序被xpath选择器捕获,但在这种情况下,你可能是正确的。

敬礼, Brunis


0

编辑下面几行,然后你就能得到你想要的了。

Private Shared Function StripHtml(html As String, xPath As String) As String
    Dim htmlDoc As New HtmlAgilityPack.HtmlDocument()
    htmlDoc.LoadHtml(html)
    If xPath.Length > 0 Then
        Dim invalidNodes As HtmlNodeCollection = htmlDoc.DocumentNode.SelectNodes(xPath)

        '------- edit this line -------------------
        'For Each node As HtmlNode In invalidNodes
        'node.ParentNode.RemoveChild(node, True)
        'Next
        '
        ' result-> bazbarfoo
        '

        '------- modify line ----------------------
        For i = invalidNodes.Count - 1 To 0 Step -1
            Dim Node As HtmlNode = invalidNodes.Item(i)
            Node.ParentNode.RemoveChild(Node, True)
        Next
        '
        ' result-> foo bar baz
        '
    End If
    Return htmlDoc.DocumentNode.WriteContentTo()


End Function

-6
你可以使用以下代码。
public string RemoveHTMLTags(string source)
{
     string expn = "<.*?>";
     return Regex.Replace(source, expn, string.Empty);
}

1
除了 HTML 标签之外,< > 中的其他内容怎么办?例如,“John Smith jsmith@email.com”,这种方法会将其剥离。 - JDwyer
3
使用正则表达式解析HTML通常不是一个好主意。参见http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html。 - TrueWill

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接