HtmlUnit无法创建HtmlPage对象

4
我是一个新手,在学习 HtmlUnit,我试着爬取一个使用 Javascript 编辑代码的网站。我听说使用HtmlUnit 是最好的方式,因为它使用无头浏览器返回最终代码。
然而,正如你所看到的,即使是创建一个 HtmlPage 对象也会抛出一个巨大且难以理解的异常(至少对于我的 HtmlUnit 经验几乎为零)。
以下是我的代码:
import com.gargoylesoftware.htmlunit.*;
import com.gargoylesoftware.htmlunit.html.HtmlPage;

public class Main {

    public static void main(String[] args) {
        Main scraper = new Main();
        scraper.testingGargoyle();


    }

    private void testingGargoyle() {
        String myUrl = "https://www.wearvr.com/#game_id=game_4";
        WebClient webClient = new WebClient();
        try {
            HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
        } catch (FailingHttpStatusCodeException | IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

以下是抛出的异常信息:

Apr 30, 2015 5:43:50 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Apr 30, 2015 5:43:50 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[https://load.sumome.com/] line=[1] lineSource=[null] lineOffset=[0]
Exception in thread "main" ======= EXCEPTION START ========
EcmaError: lineNumber=[19] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js] message=[TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:847)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:620)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:733)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1096)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:395)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:270)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:290)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:793)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:751)
    at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
    at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
    at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1017)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:248)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:194)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:471)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:345)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:410)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:395)
    at Main.testingGargoyle(Main.java:19)
    at Main.main(Main.java:10)
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3629)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3613)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3634)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3650)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.notFunctionError(ScriptRuntime.java:3714)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2233)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2215)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1333)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3057)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:724)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:832)
    ... 31 more
Enclosed exception: 
net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot find function bind in object function (e, n, r) {...}. (https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js#19)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3629)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3613)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3634)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3650)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.notFunctionError(ScriptRuntime.java:3714)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2233)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2215)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1333)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:19)
    at script.r(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
    at script.r(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:384)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
    at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:16)
    at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:7)
    at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:463)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:463)
    at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script.t(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at script(https://www.wearvr.com/assets/scripts/bundle.b4038a088bb1abfcf55c.js:1)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:309)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3057)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:724)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:832)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:620)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:733)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1096)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:395)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:270)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:290)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:793)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:751)
    at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
    at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
    at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1017)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:248)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:194)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:471)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:345)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:410)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:395)
    at Main.testingGargoyle(Main.java:19)
    at Main.main(Main.java:10)
======= EXCEPTION END ========

我告诉过你这是个大问题。为了进行抓取,我该如何绕过它并获取页面的最终源代码?

提前致谢!

2个回答

8

异常会因为多种原因而抛出,例如错误的HTML、脚本页面上的错误,找不到资源,如CSS、脚本文件或图像文件(例如<img src="bla.gif"> <- bla.gif未找到HTML404)

所以我们使用这些选项来使HTML在遇到第一个错误/问题时继续导航:

webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);

您可以通过实现空类来阻止htmlUnity在控制台上输出有关CSS/JavaScript错误的详细信息。示例代码如下:
webClient.setCssErrorHandler(new SilentCssErrorHandler());    
webClient.setJavaScriptErrorListener(new JavaScriptErrorListener(){});

小样例测试案例:

这是一个小的样例测试案例:

@Test
public void TestCall() throws FailingHttpStatusCodeException, MalformedURLException, IOException {      
    WebClient webClient = new WebClient(BrowserVersion.CHROME);
    webClient.getOptions().setUseInsecureSSL(true); //ignore ssl certificate
    webClient.getOptions().setThrowExceptionOnScriptError(false);
    webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
    String url = "https://www.wearvr.com/#game_id=game_4";
    HtmlPage myPage = webClient.getPage(url);
    webClient.waitForBackgroundJavaScriptStartingBefore(200);
    webClient.waitForBackgroundJavaScript(20000);
    //do stuff on page ex: myPage.getElementById("main")
    //myPage.asXml() <- tags and elements
    System.out.println(myPage.asText());

}

我没有解释命令 webClient.waitForBackgroundJavaScriptStartingBefore(200); webClient.waitForBackgroundJavaScript(20000); - Adrien
1
这太棒了!鉴于异常的长度,我以为这是无望的,但是谢谢你! - quantum285
2
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener()); 网络客户端设置了一个静默的 JavaScript 错误监听器。 - Luigi Rubino

1

尝试使用其他浏览器,例如:

String myUrl = "https://www.wearvr.com/#game_id=game_4";
try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
    HtmlPage myPage = ((HtmlPage) webClient.getPage(myUrl));
    System.out.println(myPage.asXml());
} catch (FailingHttpStatusCodeException | IOException e) {
    e.printStackTrace();
}

然而,这也可能是IE8模拟器中的一个错误。

是的,使用最新的快照。 - Ahmed Ashour

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接