使用PyQt5和QWebEngineView抓取JavaScript页面

Question

使用PyQt5和QWebEngineView抓取JavaScript页面

3

我正在尝试将一个JavaScript页面渲染成填充的HTML，以便进行爬取。研究不同的解决方案（selenium、反向工程页面等）使我发现了这个技术，但我无法让它起作用。顺便说一句，我是Python新手，基本上还处于剪切/粘贴/实验阶段。已经解决了安装和缩进问题，但现在卡住了。

在下面的测试代码中，print(sample_html)有效，并返回目标页面的原始html，但print(render(sample_html))总是返回单词“None”。

有趣的是，如果在亚马逊上运行此操作，他们会检测出它不是一个真正的浏览器，并返回带有有关自动访问的警告的HTML。但其他测试页面提供的真正的HTML应该可以渲染，但它没有渲染成功。

我该如何解决结果总是返回“None”的问题？

def render(source_html):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtWebEngineWidgets import QWebEngineView
    
    class Render(QWebEngineView):
        def __init__(self, html):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            self.setHtml(html)
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self.callable)

        def callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()
            
            return Render(source_html).html

import requests
#url = 'http://webscraping.com'  
#url='http://www.amazon.com'
url='https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1'
sample_html = requests.get(url).text
print(sample_html)
print(render(sample_html))

编辑：感谢您提供的反馈，已将其纳入到代码中。但现在出现了一个错误，脚本会一直挂起，直到我终止Python启动器，然后会导致segfault：

这是修订后的代码：

def render(source_url):
    """Fully render HTML, JavaScript and all."""

    import sys
    from PyQt5.QtWidgets import QApplication
    from PyQt5.QtCore import QUrl
    from PyQt5.QtWebEngineWidgets import QWebEngineView

    class Render(QWebEngineView):
        def __init__(self, url):
            self.html = None
            self.app = QApplication(sys.argv)
            QWebEngineView.__init__(self)
            self.loadFinished.connect(self._loadFinished)
            # self.setHtml(html)
            self.load(QUrl(url))
            self.app.exec_()

        def _loadFinished(self, result):
            # This is an async call, you need to wait for this
            # to be called before closing the app
            self.page().toHtml(self._callable)

        def _callable(self, data):
            self.html = data
            # Data has been stored, it's safe to quit the app
            self.app.quit()

    return Render(source_url).html

# url = 'http://webscraping.com'
# url='http://www.amazon.com'
url = "https://www.ncbi.nlm.nih.gov/nuccore/CP002059.1"
print(render(url))

这会引发以下错误：

$ python3 -tt fees-pkg-v2.py
Traceback (most recent call last):
  File "fees-pkg-v2.py", line 30, in _callable
    self.html = data
AttributeError: 'method' object has no attribute 'html'
None   (hangs here until force-quit python launcher)
Segmentation fault: 11
$

我已经开始阅读Python类相关的内容，以便更好地理解我的工作（这总是件好事）。我认为环境可能是问题的原因（OSX Yosemite、Python 3.4.3、Qt5.4.1、sip-4.16.6）。还有其他建议吗？

- Russ

看起来 render 函数的返回语句没有正确缩进，它应该与上面的类在同一级别。 - PRMoureu

让 QWebEngineView 为您完成所有工作，无需使用 requests。 QWebEngineView具有一个 load 方法，可以接受URL。 - Maurice Meyer

亚马逊会在第一次请求时检测到你是一个爬虫，除非你伪造请求头。你可以使用类似 https://pypi.python.org/pypi/fake-useragent 的工具来实现。 - Lorinc Nyitrai

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Russ · Answer 1

问题出在环境上。我手动安装了Python 3.4.3、Qt5.4.1和sip-4.16.6，可能搞砸了什么。安装Anaconda后，脚本开始正常工作。再次感谢。