如何使用Acrobat SDK将PDF文件转换为Word文件?

7
我的.Net应用程序需要以编程方式将PDF文档转换为Word格式。我评估了几个产品,发现Acrobat X Pro可以提供“另存为”选项,我们可以将文档保存为Word / Excel格式。我尝试使用Acrobat SDK,但无法找到适当的文档从哪里开始。我查看了他们的IAC示例,但无法理解如何调用菜单项并执行另存为选项。
3个回答

15

你可以使用Acrobat X Pro完成此操作,但需要在C#中使用JavaScript API。

 AcroPDDoc pdfd = new AcroPDDoc();
 pdfd.Open(sourceDoc.FileFullPath);
 Object jsObj = pdfd.GetJSObject();
 Type jsType = pdfd.GetType();
 //have to use acrobat javascript api because, acrobat
 object[] saveAsParam = { "newFile.doc", "com.adobe.acrobat.doc", "", false, false };
 jsType.InvokeMember("saveAs",BindingFlags.InvokeMethod | BindingFlags.Public | BindingFlags.Instance,null, jsObj, saveAsParam, CultureInfo.InvariantCulture);
希望这能帮到你。

嗨,我已经做了同样的事情...谢谢你的回答。但是似乎这个过程需要相当长的时间才能完成。如果我必须处理1000个文件,那么需要超过5-6个小时...有更快的方法吗? - Jay Nirgudkar
我在结尾处添加了pdfd.Close()以解锁文件。 - roeland
谢谢!非常有用。对于那些想要导出到Excel的人,只需将newFile.doc更改为newFile.xlsx,将“com.adobe.acrobat.doc”更改为“com.adobe.acrobat.xlsx”。 - Mark

3
我使用WinPython x64 2.7.6.3和Acrobat X Pro完成了类似的操作,并使用JSObject接口将PDF转换为DOCX。本质上与jle's的解决方案相同。
以下是将一组PDF转换为DOCX的完整代码:
# gets all files under ROOT_INPUT_PATH with FILE_EXTENSION and tries to extract text from them into ROOT_OUTPUT_PATH with same filename as the input file but with INPUT_FILE_EXTENSION replaced by OUTPUT_FILE_EXTENSION
from win32com.client import Dispatch
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

import winerror

# try importing scandir and if found, use it as it's a few magnitudes of an order faster than stock os.walk
try:
    from scandir import walk
except ImportError:
    from os import walk

import fnmatch

import sys
import os

ROOT_INPUT_PATH = None
ROOT_OUTPUT_PATH = None
INPUT_FILE_EXTENSION = "*.pdf"
OUTPUT_FILE_EXTENSION = ".docx"

def acrobat_extract_text(f_path, f_path_out, f_basename, f_ext):
    avDoc = Dispatch("AcroExch.AVDoc") # Connect to Adobe Acrobat

    # Open the input file (as a pdf)
    ret = avDoc.Open(f_path, f_path)
    assert(ret) # FIXME: Documentation says "-1 if the file was opened successfully, 0 otherwise", but this is a bool in practise?

    pdDoc = avDoc.GetPDDoc()

    dst = os.path.join(f_path_out, ''.join((f_basename, f_ext)))

    # Adobe documentation says "For that reason, you must rely on the documentation to know what functionality is available through the JSObject interface. For details, see the JavaScript for Acrobat API Reference"
    jsObject = pdDoc.GetJSObject()

    # Here you can save as many other types by using, for instance: "com.adobe.acrobat.xml"
    jsObject.SaveAs(dst, "com.adobe.acrobat.docx") # NOTE: If you want to save the file as a .doc, use "com.adobe.acrobat.doc"

    pdDoc.Close()
    avDoc.Close(True) # We want this to close Acrobat, as otherwise Acrobat is going to refuse processing any further files after a certain threshold of open files are reached (for example 50 PDFs)
    del pdDoc

if __name__ == "__main__":
    assert(5 == len(sys.argv)), sys.argv # <script name>, <script_file_input_path>, <script_file_input_extension>, <script_file_output_path>, <script_file_output_extension>

    #$ python get.docx.from.multiple.pdf.py 'C:\input' '*.pdf' 'C:\output' '.docx' # NOTE: If you want to save the file as a .doc, use '.doc' instead of '.docx' here and ensure you use "com.adobe.acrobat.doc" in the jsObject.SaveAs call

    ROOT_INPUT_PATH = sys.argv[1]
    INPUT_FILE_EXTENSION = sys.argv[2]
    ROOT_OUTPUT_PATH = sys.argv[3]
    OUTPUT_FILE_EXTENSION = sys.argv[4]

    # tuples are of schema (path_to_file, filename)
    matching_files = ((os.path.join(_root, filename), os.path.splitext(filename)[0]) for _root, _dirs, _files in walk(ROOT_INPUT_PATH) for filename in fnmatch.filter(_files, INPUT_FILE_EXTENSION))

    # patch ERRORS_BAD_CONTEXT as per https://mail.python.org/pipermail/python-win32/2002-March/000265.html
    global ERRORS_BAD_CONTEXT
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)

    for filename_with_path, filename_without_extension in matching_files:
        print "Processing '{}'".format(filename_without_extension)
        acrobat_extract_text(filename_with_path, ROOT_OUTPUT_PATH, filename_without_extension, OUTPUT_FILE_EXTENSION)

在Mac上,调度模块的替代方案是什么? - matsuo_basho
在Python中使用AvDoc = Dispatch("AcroExch.AVDoc")时出现(-2147221005, '无效的类字符串', None, None)错误。有什么帮助吗? - Prakash

-2

Adobe 不支持将 PDF 转换为 Word,除非您使用他们的 Acrobat PDF 客户端。这意味着您不能使用他们的 SDK 或通过调用命令行来完成转换。您只能手动完成。


jle或我发布的解决方案展示了以编程方式实现此目标的方法。如果您拥有Acrobat X Pro,您可以尝试我的脚本,因为一旦您安装了免费的WinPython x64 2.7.6.3,它应该可以直接使用。 - Subhobroto

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接