Jsoup无法下载完整页面

Question

Jsoup无法下载完整页面

3

网页链接为：http://www.hkex.com.hk/eng/market/sec_tradinfo/stockcode/eisdeqty_pf.htm 我想使用Jsoup提取所有<tr class="tr_normal">元素。

我正在使用的代码是：

Document doc = Jsoup.connect(url).get();
Elements es = doc.getElementsByClass("tr_normal");
System.out.println(es.size());

但是大小（1350）比实际大小（1452）要小。我将此页面复制到我的计算机上并删除了一些<tr>元素。然后我运行相同的代码，它是正确的。看起来有太多的元素，所以jsoup无法读取所有元素？

那么发生了什么？谢谢！

- samsara

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hutingung · Accepted Answer

问题在于内部的Jsoup Http连接处理。选择器引擎没有问题。我没有深入研究，但是处理http连接的专有方式总是存在问题。我建议用HttpClient替换它- http://hc.apache.org/。如果您无法添加http客户端作为依赖项，则可能需要检查Jsoup源代码以处理http连接。问题是Jsoup.Connection的默认maxBodySize。请参考更新的答案。*我仍然保留HttpClient代码作为示例。程序输出

从文件中加载= 1452
从http客户端加载= 1452
从jsoup连接中加载= 1350

使用maxBodySize从jsoup连接中加载= 1452

package test;

import java.io.IOException;
import java.io.InputStream;

import org.apache.http.HttpResponse;
import org.apache.http.client.ClientProtocolException;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.HttpClientBuilder;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class TestJsoup {

    /**
     * @param args
     * @throws IOException
     */
    public static void main(String[] args) throws IOException {
        Document doc = Jsoup.parse(loadContentFromClasspath(), "UTF8", "");
        Elements es = doc.getElementsByClass("tr_normal");
        System.out.println("从文件中加载= " + es.size());

        doc = Jsoup.parse(loadContentByHttpClient(), "UTF8", "");
        es = doc.getElementsByClass("tr_normal");
        System.out.println("从http客户端加载= " + es.size());

        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        doc = Jsoup.connect(url).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("从jsoup连接中加载= " + es.size());

        int maxBodySize = 2048000;//2MB (默认为1MB) 0表示无限制大小
        doc = Jsoup.connect(url).maxBodySize(maxBodySize).get();
        es = doc.getElementsByClass("tr_normal");
        System.out.println("使用maxBodySize从jsoup连接中加载= " + es.size());
    }

    public static InputStream loadContentByHttpClient()
            throws ClientProtocolException, IOException {
        String url = "http://www.hkex.com.hk/eng/market/sec_tradinfo"
                + "/stockcode/eisdeqty_pf.htm";
        HttpClient client = HttpClientBuilder.create().build();
        HttpGet request = new HttpGet(url);
        HttpResponse response = client.execute(request);
        return response.getEntity().getContent();
    }

    public static InputStream loadContentFromClasspath()
            throws ClientProtocolException, IOException {
        return TestJsoup.class.getClassLoader().getResourceAsStream(
                "eisdeqty_pf.htm");
    }

}