Python + Selenium + PhantomJS 实现 PDF 渲染

Question

Python + Selenium + PhantomJS 实现 PDF 渲染

pythonseleniumphantomjs

22

当使用Selenium和Python结合使用PhantomJS时，是否可以利用PhantomJS的渲染到PDF的功能？ (即通过Selenium在Python中模仿page.render('file.pdf')的行为)。

我知道这需要使用GhostDriver，而GhostDriver并没有真正支持打印功能。

如果有其他不依赖于Selenium的替代方案，请告诉我。

- Rejected

你看过Pypdf2吗？http://www.blog.pythonlibrary.org/tag/python-pdf-series/ - Amit

@Amit：我经常使用它，所以用得相当广泛。即使是Phaseit公司自己也说“PyPDF2对HTML一无所知”。它不能可靠地呈现任何HTML。 - Rejected

@Rejected 你需要在测试期间截取屏幕截图吗？还是只想加载页面并将其呈现为PDF？ - Jacob Swartwood

4个回答

1

你可以使用 selenium.selenium.capture_screenshot('file.png')，但这将给你一个png格式的屏幕截图而不是pdf。似乎没有办法将屏幕截图保存为pdf。

这是capture_screenshot的文档：http://selenium.googlecode.com/git/docs/api/py/selenium/selenium.selenium.html?highlight=screenshot#selenium.selenium.selenium.capture_screenshot

- KAtkinson

1

PDF是一个关键因素。出于多种原因，例如文本搜索、表单、嵌入式媒体等，我不能将其简化为简单的图像。 - Rejected

1

试过pdfkit了吗？它可以从html页面渲染PDF文件。

- moodh

我也研究过这个。PDFKit可以将HTML转换为PDF，但没有其他功能。在PDF之前进行内容分析以确定页面是否包含所需内容，可惜不可能实现。 - Rejected

是的，我也遇到了PDFKit的同样问题，我希望能够进行更高级的渲染，使用JS框架确实很麻烦.. :( - moodh

"内容分析以确定页面是否包含所需内容" -> 好吧，难道你不能自己进行内容分析，如果匹配，则直接使用pdfkit进行渲染。这就是我会做的方式。 - Jonathan

1

@Jonathan：那我不是渲染同一页，而是使用PDFKit重新获取和重新渲染的第二个检索。如果我转到PageA并动态生成内容，再次返回它意味着内容可能会更改。如果我只保存HTML并在本地打开进行转换，我会面临很多潜在问题（相对链接、热链接保护等）。 - Rejected

pdfkit可以从字符串中渲染.. @Rejected - EralpB

0

@rejected，我知道你提到不想使用子进程，但是...

理论上来说，你实际上可以利用子进程通信比你预期的要多。你可以采用 Ariya's stdin/stdout example，并将其扩展为一个相对通用的包装脚本。它可能首先接受要加载的页面，然后侦听（并执行）您在该页面上的测试操作。最终，您可以启动 .render 或甚至制作一个通用的错误处理捕获：

try {
  // load page & execute stdin commands
} catch (e) {
  page.render(page + '-error-state.pdf');
}

- Jacob Swartwood

通过标准输入接收的代码需要通过eval来执行，而从我的尝试经验来看，这既不安全也不可靠。除非我错了？ - Rejected

虽然你希望对输入进行谨慎处理（从可靠性角度考虑），但由于我假设你拥有该进程，所以你可能不必担心安全问题。 - Jacob Swartwood

你也可以为了更快地处理意外错误，将特定的命令列入白名单等。然而，我能想象到最好的情况是，你将可能在屏幕截图之前发生的测试（或其他逻辑）提取到一个单独的.js文件中，并将其加载到页面中（http://phantomjs.org/api/phantom/method/inject-js.html）。你可以让Python最多传递一个参数来加载特定的JS文件。 - Jacob Swartwood

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MTuner · Accepted Answer

以下是使用Selenium和GhostDriver的特殊命令的解决方案（自GhostDriver 1.1.0和PhantomJS 1.9.6以来，应该可以使用。已经在PhantomJS 1.9.8上测试通过）：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""Download a webpage as a PDF."""


from selenium import webdriver


def download(driver, target_path):
    """Download the currently displayed page to target_path."""
    def execute(script, args):
        driver.execute('executePhantomScript',
                       {'script': script, 'args': args})

    # hack while the python interface lags
    driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
    # set page format
    # inside the execution script, webpage is "this"
    page_format = 'this.paperSize = {format: "A4", orientation: "portrait" };'
    execute(page_format, [])

    # render current page
    render = '''this.render("{}")'''.format(target_path)
    execute(render, [])


if __name__ == '__main__':
    driver = webdriver.PhantomJS('phantomjs')
    driver.get('http://stackoverflow.com')
    download(driver, "save_me.pdf")

可以参考我对相同问题的回答这里。