如何将HTML文本转换为纯文本？

Question

如何将HTML文本转换为纯文本？

39

朋友们，我必须解析URL中的描述，其中解析的内容有一些HTML标签，那么我该如何将其转换为纯文本。

- MGSenthil

您的具体要求是什么？您需要去除HTML标签吗？还是提取特定标签的内容？ - Vivien Barousse

我能够提取内容，但是内容包含zcc dsdfsf ddfdfsf sfdfdfdfdf这样的HTML标签。虽然我已经获取到了数据，但我需要的是简单的纯文本，不带这些HTML标签。 - MGSenthil

类似的问题和好的答案在这里：https://dev59.com/Q3I_5IYBdhLWcg3wHfSr#1519726。我使用了Jericho，它运行良好。 - рüффп

1

你应该将这个问题标记为已回答。 - ankitjaininfo

1

重复的问题：https://dev59.com/UXVC5IYBdhLWcg3wnCaA，https://dev59.com/zXI-5IYBdhLWcg3wu7Lv，https://dev59.com/Q3I_5IYBdhLWcg3wHfSr和https://dev59.com/EXRA5IYBdhLWcg3wxA9N。 - koppor

10个回答

29

只是要去掉HTML标签很简单：

// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character 
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");

但不幸的是，要求从来不会那么简单:

通常情况下，和<div>元素需要分别处理。可能存在包含>字符（例如javascript）的cdata块，这会破坏正则表达式等。

- Sean Patrick Floyd

1

关于为什么这对于一般情况不起作用，而且不会是 fool-proof 的背景，请参阅：正则表达式匹配除 XHTML 自包含标签之外的开放标签。 - Erwin Bolwidt

喜欢它...如此简单，却又如此强大 - George

10

你可以使用这行代码去除HTML标签，并将其显示为纯文本。

htmlString=htmlString.replaceAll("\\<.*?\\>", "");

- Kandha

8

使用 Jsoup。

添加依赖项。

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

现在在你的Java代码中：

public static String html2text(String html) {
        return Jsoup.parse(html).wholeText();
    }

只需调用html2text方法，传入HTML文本，它将返回纯文本。

- xxx

5

使用像htmlCleaner这样的HTML解析器。

详细答案请参考：如何在Java中删除HTML标记

- ankitjaininfo

1

我建议通过 jTidy 解析原始 HTML，这样可以得到输出结果，您可以编写 xpath 表达式来进行解析。这是我发现的最强大的 HTML 抓取方式。

- Jon Freedman

1

如果您想像浏览器一样解析，请使用以下代码：

import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;

public class RenderToText {
    public static void main(String[] args) throws Exception {
        String sourceUrlString="data/test.html";
        if (args.length==0)
          System.err.println("Using default argument of \""+sourceUrlString+'"');
        else
            sourceUrlString=args[0];
        if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
        Source source=new Source(new URL(sourceUrlString));
        String renderedText=source.getRenderer().toString();
        System.out.println("\nSimple rendering of the HTML document:\n");
        System.out.println(renderedText);
  }
}

我希望这能帮助解析表格，包括浏览器格式。

谢谢， Ganesh

- Ganesan Palanisamy

能否请那些给我点踩的人解释一下为什么要这样做？ - koppor

0

我需要一个包含FreeMarker标签的HTML的纯文本表示。问题交给了我一个JSoup的解决方案，但是JSoup会转义FreeMarker标签，从而破坏功能。我还尝试了htmlCleaner（sourceforge），但它会保留HTML头和样式内容（已删除标签）。 https://dev59.com/Q3I_5IYBdhLWcg3wHfSr#1519726

我的代码：

return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();

maxLineLength 确保行不会在 80 个字符处被人为地换行。 setNewLine(null) 使用与源相同的换行符。

- John Camerin

0

使用Jsoup，我将所有文本都放在同一行中。

因此，我使用了以下代码块来解析HTML并保留换行符：

private String parseHTMLContent(String toString) {
    String result = toString.replaceAll("\\<.*?\\>", "\n");
    String previousResult = "";
    while(!previousResult.equals(result)){
        previousResult = result;
        result = result.replaceAll("\n\n","\n");
    }
    return result;
}

不是最好的解决方案，但解决了我的问题 :)

- Akshay More

0

我使用 HTMLUtil.textFromHTML(value)。

<dependency>
    <groupId>org.clapper</groupId>
    <artifactId>javautil</artifactId>
    <version>3.2.0</version>
</dependency>

- Ruslanas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ranjit · Accepted Answer

是的，Jsoup将是更好的选择。只需按照以下步骤将整个HTML文本转换为纯文本。

String plainText= Jsoup.parse(yout_html_text).text();