使用Python将.doc转换为.pdf

Question

使用Python将.doc转换为.pdf

77

我被分配任务将大量的.doc文件转换成.pdf格式。而我的主管只想让我使用MSWord 2010来完成此任务。我知道应该可以通过python COM自动化来实现这一过程。唯一的问题是我不知道该从哪里开始以及如何操作。我尝试搜索一些教程，但没有找到任何有用的信息（也许我已经找到了，但我不知道我在寻找什么）。

目前，我正在阅读这个网站。不知道这对我有多大帮助。

- nik

14个回答

51

您可以使用docx2pdf Python包批量将docx转换为pdf。它可以用作CLI和Python库。它需要安装Microsoft Office，并在Windows上使用COM，在macOS上使用AppleScript（JXA）。

from docx2pdf import convert

convert("input.docx")
convert("input.docx", "output.pdf")
convert("my_docx_folder/")

pip install docx2pdf
docx2pdf input.docx output.pdf

免责声明：本人编写了docx2pdf工具包。https://github.com/AlJohri/docx2pdf

- Al Johri

12

很遗憾，它需要安装Microsoft Office，因此只能在Windows和macOS上使用。 - Al Johri

@AlJohri，请看这里https://michalzalecki.com/converting-docx-to-pdf-using-python/，这个解决方案适用于Windows和Linux。在Linux上运行是必须的，因为大多数部署服务器使用Linux。 - abdelhedi hlel

所需的解决方案是文档，而docx2pdf不适用于doc格式... - diek

24

我尝试了许多解决方案，但是它们中没有一个在Linux发行版上有效地工作。

我推荐使用这个解决方案：

import sys
import subprocess
import re


def convert_to(folder, source, timeout=None):
    args = [libreoffice_exec(), '--headless', '--convert-to', 'pdf', '--outdir', folder, source]

    process = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, timeout=timeout)
    filename = re.search('-> (.*?) using filter', process.stdout.decode())

    return filename.group(1)


def libreoffice_exec():
    # TODO: Provide support for more platforms
    if sys.platform == 'darwin':
        return '/Applications/LibreOffice.app/Contents/MacOS/soffice'
    return 'libreoffice'

然后你调用你的函数：

result = convert_to('TEMP Directory',  'Your File', timeout=15)

所有资源:

https://michalzalecki.com/converting-docx-to-pdf-using-python/

- abdelhedi hlel

1

这不是使用Python，而只是从Python脚本中运行LibreOffice可执行文件。 - not2qubit

感谢您提供的解决方案，先生。它甚至可以在Google Colab上运行，因此您可以随时进行操作。 - florianc63

17

我已经花了半天的时间研究这个问题，所以我认为我应该分享一些我在这方面的经验。Steven的答案是正确的，但它在我的电脑上会失败。要解决这个问题有两个关键点：

(1). 第一次创建 'Word.Application' 对象时，我应该在打开任何文档之前使其 (word app) 可见。（实际上，即使我自己也无法解释为什么这样做会起作用。如果我在我的电脑上不这样做，在隐身模式下尝试打开文档时程序会崩溃，然后 'Word.Application' 对象将被操作系统删除。）

(2). 在完成第一步后，程序有时可能表现良好，但也可能经常失败。崩溃错误 "COMError: (-2147418111, 'Callee 拒绝了调用', (None, None, None, 0, None))" 表明 COM 服务器可能无法及时响应。因此，在试图打开文档之前，我添加了一个延迟。

完成这两个步骤后，程序将完美地工作，不再出现故障。演示代码如下。如果您遇到相同的问题，请尝试遵循这两个步骤。希望对您有所帮助。

    import os
    import comtypes.client
    import time


    wdFormatPDF = 17


    # absolute path is needed
    # be careful about the slash '\', use '\\' or '/' or raw string r"..."
    in_file=r'absolute path of input docx file 1'
    out_file=r'absolute path of output pdf file 1'

    in_file2=r'absolute path of input docx file 2'
    out_file2=r'absolute path of outputpdf file 2'

    # print out filenames
    print in_file
    print out_file
    print in_file2
    print out_file2


    # create COM object
    word = comtypes.client.CreateObject('Word.Application')
    # key point 1: make word visible before open a new document
    word.Visible = True
    # key point 2: wait for the COM Server to prepare well.
    time.sleep(3)

    # convert docx file 1 to pdf file 1
    doc=word.Documents.Open(in_file) # open docx file 1
    doc.SaveAs(out_file, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 1
    word.Visible = False
    # convert docx file 2 to pdf file 2
    doc = word.Documents.Open(in_file2) # open docx file 2
    doc.SaveAs(out_file2, FileFormat=wdFormatPDF) # conversion
    doc.Close() # close docx file 2   
    word.Quit() # close Word Application

- Yang

8

unoconv是一个用Python编写的工具，它以无头守护进程方式运行OpenOffice。

https://github.com/unoconv/unoconv

http://dag.wiee.rs/home-made/unoconv/

对于doc、docx、ppt、pptx、xls和xlsx格式文件非常有效。

如果你需要在服务器上转换文档或保存/转换特定格式，那么这个工具非常有用。

- lxx

4

你能否提供一个示例代码，展示如何从 Python 脚本中实现它（import unoconv unoconv.dosomething(...)）？文档中只展示了如何通过命令行来实现。 - Basj

1

请注意，Unoconv有一个重写版本叫做“Unoserver”：https://github.com/unoconv/unoserver/我们已经在生产环境中成功运行Unoserver，并且现在它是推荐的解决方案。Unoserver并没有Unoconv的所有功能，它将获得哪些功能取决于人们的需求以及是否有人想要实现它。在Unoserver具备人们需要的所有主要功能之前，Unoconv处于错误修复模式，不会有重大变化... 我认为我还是会选择Unoconv。 - Att Righ

提醒其他使用者，我在使用unoconv时遇到了问题。我采用的方法（在Linux和Docker中运行良好）是直接调用LibreOffice，如此答案所述。 - Att Righ

7

作为SaveAs功能的替代，您也可以使用ExportAsFixedFormat，它使您可以访问Word中通常看到的PDF选项对话框。通过这个，您可以指定书签和其他文档属性。

doc.ExportAsFixedFormat(OutputFileName=pdf_file,
    ExportFormat=17, #17 = PDF output, 18=XPS output
    OpenAfterExport=False,
    OptimizeFor=0,  #0=Print (higher res), 1=Screen (lower res)
    CreateBookmarks=1, #0=No bookmarks, 1=Heading bookmarks only, 2=bookmarks match word bookmarks
    DocStructureTags=True
    );

完整的函数参数列表如下：'OutputFileName'，'ExportFormat'，'OpenAfterExport'，'OptimizeFor'，'Range'，'From'，'To'，'Item'，'IncludeDocProps'，'KeepIRM'，'CreateBookmarks'，'DocStructureTags'，'BitmapMissingFonts'，'UseISO19005_1'，'FixedFormatExtClassPtr'

- patrick

4

值得注意的是，Stevens的答案是可行的，但如果使用for循环来导出多个文件，请确保在循环之前放置ClientObject或Dispatch语句 - 它只需要创建一次 - 参见我的问题：Python win32com.client.Dispatch循环处理Word文档并导出为PDF; 当下一个循环发生时失败。

- James N

3

如果你不介意使用PowerShell，可以看一下这篇Hey, Scripting Guy! article。所呈现的代码可以采用WdSaveFormat枚举值中的wdFormatPDF（请参见此处）。这篇博客文章提供了同样想法的另一种实现。

- Bas Bossink

2

我是一个Linux/Unix用户，更倾向于使用Python。但是这个PS脚本看起来非常简单，正是我所需要的。谢谢 :) - nik

3

我已经对它进行了修改，以支持ppt。我的解决方案支持以下所有指定的扩展名。

word_extensions = [".doc", ".odt", ".rtf", ".docx", ".dotm", ".docm"]
ppt_extensions = [".ppt", ".pptx"]

我的解决方案：Github链接

我修改了Docx2PDF的代码。

- Mobasshir Bhuiya

2

我尝试了被接受的答案，但并不喜欢Word制作的臃肿PDF文件太大，通常比预期的大一个数量级。在寻找如何在使用虚拟PDF打印机时禁用对话框后，我发现了Bullzip PDF Printer，并对其功能印象深刻。它现在取代了我之前使用的其他虚拟打印机。您可以在他们的下载页面上找到“免费社区版”。

COM API可以在这里找到，可用设置的列表可以在这里找到。设置被写入“runonce”文件中，该文件仅用于一次打印作业，然后会自动删除。当打印多个PDF时，我们需要确保一个打印作业完成后再开始另一个，以确保每个文件都正确使用设置。

import os, re, time, datetime, win32com.client

def print_to_Bullzip(file):
    util = win32com.client.Dispatch("Bullzip.PDFUtil")
    settings = win32com.client.Dispatch("Bullzip.PDFSettings")
    settings.PrinterName = util.DefaultPrinterName      # make sure we're controlling the right PDF printer

    outputFile = re.sub("\.[^.]+$", ".pdf", file)
    statusFile = re.sub("\.[^.]+$", ".status", file)

    settings.SetValue("Output", outputFile)
    settings.SetValue("ConfirmOverwrite", "no")
    settings.SetValue("ShowSaveAS", "never")
    settings.SetValue("ShowSettings", "never")
    settings.SetValue("ShowPDF", "no")
    settings.SetValue("ShowProgress", "no")
    settings.SetValue("ShowProgressFinished", "no")     # disable balloon tip
    settings.SetValue("StatusFile", statusFile)         # created after print job
    settings.WriteSettings(True)                        # write settings to the runonce.ini
    util.PrintFile(file, util.DefaultPrinterName)       # send to Bullzip virtual printer

    # wait until print job completes before continuing
    # otherwise settings for the next job may not be used
    timestamp = datetime.datetime.now()
    while( (datetime.datetime.now() - timestamp).seconds < 10):
        if os.path.exists(statusFile) and os.path.isfile(statusFile):
            error = util.ReadIniString(statusFile, "Status", "Errors", '')
            if error != "0":
                raise IOError("PDF was created with errors")
            os.remove(statusFile)
            return
        time.sleep(0.1)
    raise IOError("PDF creation timed out")

- user2921789

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Steven · Accepted Answer

这是一个使用comtypes的简单示例，将单个文件进行转换，输入和输出文件名以命令行参数的形式给出:

import sys
import os
import comtypes.client

wdFormatPDF = 17

in_file = os.path.abspath(sys.argv[1])
out_file = os.path.abspath(sys.argv[2])

word = comtypes.client.CreateObject('Word.Application')
doc = word.Documents.Open(in_file)
doc.SaveAs(out_file, FileFormat=wdFormatPDF)
doc.Close()
word.Quit()

您还可以使用pywin32，除以下几点不同之外，它与上述方法相同：

import win32com.client

然后：

word = win32com.client.Dispatch('Word.Application')