HTMLagilitypack不能删除所有的HTML标签，我该如何高效地解决这个问题？

Question

HTMLagilitypack不能删除所有的HTML标签，我该如何高效地解决这个问题？

13

我正在使用以下方法从字符串中删除所有HTML：

public static string StripHtmlTags(string html)
        {
            if (String.IsNullOrEmpty(html)) return "";
            HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
            doc.LoadHtml(html);
            return doc.DocumentNode.InnerText;
        }

但似乎忽略了以下标记：[…]

因此，该字符串基本上返回：

> A hungry thief who stole a rack of pork ribs from a grocery store has
> been sentenced to spend 50 years in prison. Willie Smith Ward felt the
> full force of the law after being convicted of the crime in Waco,
> Texas, on Wednesday. The 43-year-old may feel slightly aggrieved over
> the severity of the [&#8230;]

我该如何确保这种标记被剥离？

非常感谢任何帮助。

- Obsivus

…不是HTML标签。标签要有尖括号（angle brackets）。这是一个编码实体。 - jessehouwing

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Damith · Accepted Answer

尝试使用 HttpUtility.HtmlDecode

public static string StripHtmlTags(string html)
{
    if (String.IsNullOrEmpty(html)) return "";
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
}

HtmlDecode函数将把[…]转换为[…]