将整个html文件读取为字符串？

Question

将整个html文件读取为字符串？

38

有没有比以下方法更好的方式将整个html文件读入单个字符串变量中：

    String content = "";
    try {
        BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
        String str;
        while ((str = in.readLine()) != null) {
            content +=str;
        }
        in.close();
    } catch (IOException e) {
    }

- membersound

8个回答

28

如果你正在使用Apache Commons，那么有一个名为IOUtils.toString(..)的实用工具。

如果你正在使用Guava，那么也可以使用Files.readLines(..)和Files.toString(..)。

- Johan Sjöberg

2

第一个链接已经失效。 - SpringLearner

1

两个链接现在都失效了。 - Muhammad Ramzan

7

您可以使用JSoup。它是一个非常强大的Java HTML解析器。

- SAbbasizadeh

5

如Jean所说，使用StringBuilder而不是+=更好。但如果您正在寻找更简单的东西，Guava、IOUtils和Jsoup都是不错的选择。

使用Guava的示例：

String content = Files.asCharSource(new File("/path/to/mypage.html"), StandardCharsets.UTF_8).read();

使用IOUtils的示例：

InputStream in = new URL("/path/to/mypage.html").openStream();
String content;

try {
   content = IOUtils.toString(in, StandardCharsets.UTF_8);
 } finally {
   IOUtils.closeQuietly(in);
 }

使用Jsoup的示例：

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").toString();

或者

String content = Jsoup.parse(new File("/path/to/mypage.html"), "UTF-8").outerHtml();

注意:

Files.readLines() 和 Files.toString()

自Guava版本22.0 (2017年5月22日)发布起，这些已经被弃用。应使用 Files.asCharSource(), 如上述示例所示 (版本22.0发布日志)

IOUtils.toString(InputStream) 和 Charsets.UTF_8

自Apache Commons-IO 2.5版本（2016年5月6日）起，此方法已被弃用。现在应该像上面的示例一样将InputStream和Charset传递给IOUtils.toString。应使用Java 7的StandardCharsets而不是Charsets，如上面的示例所示。（已弃用的Charsets.UTF_8）

- Kat

4

我更喜欢使用Guava：

import com.google.common.base.Charsets;
import com.google.common.io.Files;
File file = new File("/path/to/file", Charsets.UTF_8);
String content = Files.toString(file);

- jknair

注：文件路径后缺少一个）. - Zoltán Umlauf

3

对于字符串操作，使用StringBuilder或StringBuffer类来累加字符串数据块。不要使用 += 操作符来操作字符串对象。 String 类是不可变的，如果在运行时生成大量的字符串对象，将影响性能。

改为使用StringBuilder/StringBuffer实例的 .append() 方法。

- user784540

0

 import org.apache.commons.io.IOUtils;
 import java.io.IOException;     
    try {
               var content = new String(IOUtils.toByteArray ( this.getClass().
                        getResource("/index.html")));
            } catch (IOException e) {
                e.printStackTrace();
            }

//假设index.html文件在资源文件夹中，以下是Java 10代码示例。

- ThmHarsh

0

以下是使用标准Java库检索网页HTML的解决方案：

import java.io.*;
import java.net.*;

String urlToRead = "https://google.com";
URL url; // The URL to read
HttpURLConnection conn; // The actual connection to the web page
BufferedReader rd; // Used to read results from the web page
String line; // An individual line of the web page HTML
String result = ""; // A long string containing all the HTML
try {
 url = new URL(urlToRead);
 conn = (HttpURLConnection) url.openConnection();
 conn.setRequestMethod("GET");
 rd = new BufferedReader(new InputStreamReader(conn.getInputStream()));
 while ((line = rd.readLine()) != null) {
  result += line;
 }
 rd.close();
} catch (Exception e) {
 e.printStackTrace();
}

System.out.println(result);

SRC

- Pedro Lobito

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jean Logeart · Accepted Answer

您应该使用StringBuilder：

StringBuilder contentBuilder = new StringBuilder();
try {
    BufferedReader in = new BufferedReader(new FileReader("mypage.html"));
    String str;
    while ((str = in.readLine()) != null) {
        contentBuilder.append(str);
    }
    in.close();
} catch (IOException e) {
}
String content = contentBuilder.toString();