如何使用Java编程程序下载网页

Question

如何使用Java编程程序下载网页

javahttpcompression

122

我想要获取网页的HTML并保存到一个String中，以便我可以对其进行一些处理。同时，我该如何处理不同类型的压缩。

使用Java，我该如何实现这个功能？

- jjnguy

这基本上是 https://dev59.com/9XNA5IYBdhLWcg3wh-cC 的一个特例。 - Robin Green

11个回答

117

以下是使用Java的URL类的测试代码。我建议在处理异常或将其向上传递到调用堆栈时，要比我做得更好。

public static void main(String[] args) {
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL("http://stackoverflow.com/");
        is = url.openStream();  // throws an IOException
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {
            // nothing to see here
        }
    }
}

- Bill the Lizard

16

DataInputStream.readLine() 已经被弃用，但除此之外这个例子非常好。我使用了一个被BufferedReader()包装的InputStreamReader()来获取readLine()函数。 - mjh2007

2

这个并没有考虑字符编码，所以虽然对于ASCII文本看起来可以工作，但当出现不匹配时最终会导致“奇怪的字符”。 - artbristol

1

@akapelko 谢谢你。我更新了我的答案，删除了对已弃用方法的调用。 - Bill the Lizard

2

关闭 InputStreamReader 怎么样？ - Alexander

如果您需要获取所有行并将它们组合在一起，请使用StringBuilder的append("line")方法，而不是System.out.println(line)；这将是组合所有行的最有效方式。 - Kirill Karmazin

显示剩余2条评论

27

Bill的回答非常好，但您可能希望对请求进行一些处理，例如压缩或用户代理。以下代码展示了如何向您的请求添加各种类型的压缩。

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;

// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
    inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
    inStr = new InflaterInputStream(conn.getInputStream(),
      new Inflater(true));
} else {
    inStr = conn.getInputStream();
}

要设置用户代理，请添加以下代码：

conn.setRequestProperty ( "User-agent", "my agent name");

- jjnguy

对于那些想要将InputStream转换为字符串的人，请参见此答案。 - SE Does Not Like Dissent

setFollowRedirects有帮助，我在我的情况下使用setInstanceFollowRedirects，在使用它之前，在许多情况下我得到了空的网页。我假设您尝试使用压缩来更快地下载文件。 - gouessej

13

你可以使用内置的库，例如URL和URLConnection，但它们不能提供很多控制。

~~个人建议使用Apache HTTPClient库。~~
编辑： Apache将HTTPClient设置为终止生命周期。替代方案是：HTTP Components

- Jon Skeet

System.Net.WebRequest没有Java版本吗？ - FlySwat

1

这个东西可以叫做URL。 :-) 例如：new URL("http://www.google.com").openStream() // => InputStream - Daniel Spiewak

1

@Jonathan：大部分是像Daniel说的那样，虽然WebRequest比URL更具控制性。在我看来，HTTPClient的功能更接近。 - Jon Skeet

9

以上提到的所有方法都无法像浏览器中显示的那样下载网页文本。现在，许多数据都是通过 HTML 页面中的脚本加载到浏览器中的。以上提到的技术都不支持脚本，它们只下载 HTML 文本。HTMLUNIT支持JavaScript。因此，如果您想要下载与浏览器中显示的网页文本一致的内容，则应使用HTMLUNIT。

- user3690910

2

你很可能需要从安全网页（https协议）中提取代码。在下面的例子中，html文件将被保存到c:\temp\filename.html。享受吧！

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;

import javax.net.ssl.HttpsURLConnection;

/**
 * <b>Get the Html source from the secure url </b>
 */
public class HttpsClientUtil {
    public static void main(String[] args) throws Exception {
        String httpsURL = "https://stackoverflow.com";
        String FILENAME = "c:\\temp\\filename.html";
        BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
        URL myurl = new URL(httpsURL);
        HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
        con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
        InputStream ins = con.getInputStream();
        InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
        BufferedReader in = new BufferedReader(isr);
        String inputLine;

        // Write each line into the file
        while ((inputLine = in.readLine()) != null) {
            System.out.println(inputLine);
            bw.write(inputLine);
        }
        in.close(); 
        bw.close();
    }
}

- Supercoder

1

使用NIO.2强大的Files.copy(InputStream in, Path target)方法即可实现此功能：

URL url = new URL( "http://download.me/" );
Files.copy( url.openStream(), Paths.get("downloaded.html" ) );

- Jan Tibar

0

从这个类中获取帮助，它可以获取代码并过滤一些信息。

public class MainActivity extends AppCompatActivity {

    EditText url;
    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate( savedInstanceState );
        setContentView( R.layout.activity_main );

        url = ((EditText)findViewById( R.id.editText));
        DownloadCode obj = new DownloadCode();

        try {
            String des=" ";

            String tag1= "<div class=\"description\">";
            String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();

            url.setText( l );
            url.setText( " " );

            String[] t1 = l.split(tag1);
            String[] t2 = t1[0].split( "</div>" );
            url.setText( t2[0] );

        }
        catch (Exception e)
        {
            Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
        }

    }
                                        // input, extrafunctionrunparallel, output
    class DownloadCode extends AsyncTask<String,Void,String>
    {
        @Override
        protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
        {
            String htmlcontent = " ";
            try {
                URL url = new URL( WebAddress[0] );
                HttpURLConnection c = (HttpURLConnection) url.openConnection();
                c.connect();
                InputStream input = c.getInputStream();
                int data;
                InputStreamReader reader = new InputStreamReader( input );

                data = reader.read();

                while (data != -1)
                {
                    char content = (char) data;
                    htmlcontent+=content;
                    data = reader.read();
                }
            }
            catch (Exception e)
            {
                Log.i("Status : ",e.toString());
            }
            return htmlcontent;
        }
    }
}

- Sohaib Aslam

0

Jetty有一个HTTP客户端，可以用来下载网页。

package com.zetcode;

import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;

public class ReadWebPageEx5 {

    public static void main(String[] args) throws Exception {

        HttpClient client = null;

        try {

            client = new HttpClient();
            client.start();
            
            String url = "http://example.com";

            ContentResponse res = client.GET(url);

            System.out.println(res.getContentAsString());

        } finally {

            if (client != null) {

                client.stop();
            }
        }
    }
}

这个例子打印了一个简单网页的内容。

在我写的使用Java读取网页教程中，我提供了六个示例，介绍了如何使用URL、JSoup、HtmlCleaner、Apache HttpClient、Jetty HttpClient和HtmlUnit在Java中编程下载网页。

- Jan Bodnar

0

在Unix/Linux系统上，你可以直接运行“wget”，但如果你正在编写跨平台客户端，这并不是一个真正的选择。当然，这假定你在下载数据后并不想对其进行太多处理，只是将其存储到磁盘中。

- Timo Geusch

我也会从这个方法开始，如果不够用再进行重构。 - Dustin Getz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- BalusC · Accepted Answer

我会使用一个像Jsoup这样的优秀HTML解析器。然后就可以简单地进行操作：

String html = Jsoup.connect("http://stackoverflow.com").get().html();

它完全透明地处理GZIP和分块响应以及字符编码。它还提供了更多的优势，比如可以像jQuery一样通过CSS选择器来遍历和操作HTML。你只需要将其作为Document而不是String获取即可。

Document document = Jsoup.connect("http://google.com").get();

如果要处理HTML，您真的不想运行基本的字符串方法甚至正则表达式。

另请参阅：

Java中领先的HTML解析器的优缺点是什么？