正则表达式用于去除HTML标签

Question

正则表达式用于去除HTML标签

17

我有一个HTML输入：

<font size="5"><p>some text</p>
<p> another text</p></font>

我想使用正则表达式移除HTML标签，以使输出变成：

some text
another text

有人能建议如何使用正则表达式来实现这个吗？

- ADIT

19

不要试图使用正则表达式解析HTML，这只会导致灾难。 - Jon Skeet

2

请阅读类似问题的答案：https://dev59.com/X3I-5IYBdhLWcg3wq6do#1732454 - Sean Patrick Floyd

进一步阅读：https://dev59.com/EXRA5IYBdhLWcg3wxA9N - Andreas Dolk

5个回答

9

使用HTML解析器。这里是一个Jsoup示例。

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = Jsoup.parse(input).text();
System.out.println(stripped);

结果：

一些文本 另一些文本

或者如果你想保留换行符：

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
for (String line : input.split("\n")) {
    String stripped = Jsoup.parse(line).text();
    System.out.println(stripped);
}

结果：

一些文本
另一些文本

Jsoup还提供了更多优势。您可以使用select()方法轻松提取HTML文档的特定部分，该方法接受类似于jQuery的CSS选择器。它只需要文档是语义上良好形成的。 1998年以来已弃用的<font>标记的存在已经不是很好的指示，但如果您预先深入了解HTML结构，仍然可以做到。

另请参阅：

Java中领先的HTML解析器的优缺点

- BalusC

请注意，使用Jsoup实际上不仅会剥离HTML标签，还会添加空格以分隔元素。文本字数将大于HTML文本，例如在tinymce编辑器中编写的文本，如果您需要剥离标签，则应该知道这一点。 - Johncl

4

你可以选择使用名为Jericho Html解析器的HTML解析器。

你可以从这里下载-http://jericho.htmlparser.net/docs/index.html Jericho HTML解析器是一个Java库，允许分析和操作HTML文档的部分，包括服务器端标记，同时原样重现任何未被识别或无效的HTML。它还提供了高级HTML表单操作功能。

糟糕格式的HTML的存在不会干扰解析。

- Prabhakaran

1

Jsoup期望格式良好的HTML，因此在处理任意HTML时，它并不比Jericho更好。 - sproketboy

3

从aioobe的代码开始，我尝试了更大胆的东西：

String input = "<font size=\"5\"><p>some text</p>\n<p>another text</p></font>";
String stripped = input.replaceAll("</?(font|p){1}.*?/?>", "");
System.out.println(stripped);

去除每个HTML标签的代码应该是这样的：

public class HtmlSanitizer {

    private static String pattern;

    private final static String [] tagsTab = {"!doctype","a","abbr","acronym","address","applet","area","article","aside","audio","b","base","basefont","bdi","bdo","bgsound","big","blink","blockquote","body","br","button","canvas","caption","center","cite","code","col","colgroup","content","data","datalist","dd","decorator","del","details","dfn","dir","div","dl","dt","element","em","embed","fieldset","figcaption","figure","font","footer","form","frame","frameset","h1","h2","h3","h4","h5","h6","head","header","hgroup","hr","html","i","iframe","img","input","ins","isindex","kbd","keygen","label","legend","li","link","listing","main","map","mark","marquee","menu","menuitem","meta","meter","nav","nobr","noframes","noscript","object","ol","optgroup","option","output","p","param","plaintext","pre","progress","q","rp","rt","ruby","s","samp","script","section","select","shadow","small","source","spacer","span","strike","strong","style","sub","summary","sup","table","tbody","td","template","textarea","tfoot","th","thead","time","title","tr","track","tt","u","ul","var","video","wbr","xmp"};

    static {
        StringBuffer tags = new StringBuffer();
        for (int i=0;i<tagsTab.length;i++) {
            tags.append(tagsTab[i].toLowerCase()).append('|').append(tagsTab[i].toUpperCase());
            if (i<tagsTab.length-1) {
                tags.append('|');
            }
        }
        pattern = "</?("+tags.toString()+"){1}.*?/?>";
    }

    public static String sanitize(String input) {
        return input.replaceAll(pattern, "");
    }

    public final static void main(String[] args) {
        System.out.println(HtmlSanitizer.pattern);

        System.out.println(HtmlSanitizer.sanitize("<font size=\"5\"><p>some text</p><br/> <p>another text</p></font>"));
    }

}

我写这个是为了符合Java 1.4标准，因为某些悲惨的原因，所以您可以自由使用for each和StringBuilder...

优点:

您可以生成要去除的标签列表，这意味着您可以保留想要的标签
避免去除非HTML标记的内容
保留空格

缺点:

您必须列出您想从字符串中去除的所有HTML标记。例如，如果您想去除所有东西，可能会有很多。

如果您发现其他缺点，请务必告诉我。

- Alexis Dufrenoy

2

如果您使用 Jericho，那么您只需要使用类似这样的内容：

public String extractAllText(String htmlText){
    Source source = new Source(htmlText);
    return source.getTextExtractor().toString();
}

当然，即使使用一个元素也可以做同样的事情：

for (Element link : links) {
  System.out.println(link.getTextExtractor().toString());
}

- Fabiano Francesconi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- aioobe · Accepted Answer

既然你问了，这里有一个简单而快速的解决方案：

String stripped = input.replaceAll("<[^>]*>", "");

(Ideone.com演示)

使用正则表达式处理HTML并不是一个好主意。上述的技巧无法处理这样的内容：

<tag attribute=">">Hello</tag>
<script>if (a < b) alert('Hello>');</script>

更好的方法是使用例如Jsoup。要从字符串中删除所有标记，您可以执行Jsoup.parse(html).text()。