从PDF中提取文本

7

我有一堆需要转换成TXT格式的PDF文件。不幸的是,当我使用许多可用的工具之一进行转换时,它会丢失所有格式,并且PDF中的表格数据都会混乱。是否可以使用Python通过指定位置等来提取PDF中的文本?

谢谢。


你是否查找过适用于此的库? - John La Rooy
我还没有找到任何用于阅读它们的,但是有很多用于编写它们的。 - Mridang Agarwalla
4个回答

3

除非包含结构化内容,否则PDF不包含表格数据。一些工具包括启发式算法来尝试猜测数据结构并将其放回。我写了一篇博客文章,解释了PDF文本提取的问题,在http://www.jpedal.org/PDFblog/2009/04/pdf-text/


有没有办法检查PDF是否标记为Adobe的结构化内容,就像您在博客文章中写的那样?谢谢。 - Mridang Agarwalla
你需要查看标签是否存在。 - mark stephens
那个链接似乎已经失效了。你有新的 URL 吗? - Bill the Lizard

2
$ pdftotext -layout thingwithtablesinit.pdf

将会生成一个名为 thingwithtablesinit.txt 的文本文件,并将其中的表格设置正确。

1

我曾经遇到过类似的问题,最终使用了来自http://www.foolabs.com/xpdf/的XPDF。 其中一个工具是PDFtoText,但我想这一切都取决于PDF是如何生成的。


1
我也尝试了几种方法。我使用了PyPDF和PDF Miner,甚至使用Acrobat保存为文本。但是没有一个像xpdf的pdftotext使用-layout选项那样有效。我不会再去尝试其他的方法了。 - chrisfs

0

如其他答案所述,从PDF中提取文本并不是一项直接的任务。然而,有一些Python库,例如pdfminer(Python 3的pdfminer3k),它们是相当有效的。

下面的代码片段显示了一个Python类,可以实例化以从PDF中提取文本。这在大多数情况下都能正常工作。

(来源 - https://gist.github.com/vinovator/a46341c77273760aa2bb

# Python 2.7.6
# PdfAdapter.py

""" Reusable library to extract text from pdf file
Uses pdfminer library; For Python 3.x use pdfminer3k module
Below links have useful information on components of the program
https://euske.github.io/pdfminer/programming.html
http://denis.papathanasiou.org/posts/2010.08.04.post.html
"""


from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
# From PDFInterpreter import both PDFResourceManager and PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
# from pdfminer.pdfdevice import PDFDevice
# To raise exception whenever text extraction from PDF is not allowed
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.layout import LAParams, LTTextBox, LTTextLine
from pdfminer.converter import PDFPageAggregator
import logging

__doc__ = "eusable library to extract text from pdf file"
__name__ = "pdfAdapter"

""" Basic logging config
"""
log = logging.getLogger(__name__)
log.addHandler(logging.NullHandler())


class pdf_text_extractor:
    """ Modules overview:
     - PDFParser: fetches data from pdf file
     - PDFDocument: stores data parsed by PDFParser
     - PDFPageInterpreter: processes page contents from PDFDocument
     - PDFDevice: translates processed information from PDFPageInterpreter
        to whatever you need
     - PDFResourceManager: Stores shared resources such as fonts or images
        used by both PDFPageInterpreter and PDFDevice
     - LAParams: A layout analyzer returns a LTPage object for each page in
         the PDF document
     - PDFPageAggregator: Extract the decive to page aggregator to get LT
         object elements
    """

def __init__(self, pdf_file_path, password=""):
    """ Class initialization block.
    Pdf_file_path - Full path of pdf including name
    password = If not passed, assumed as none
    """
    self.pdf_file_path = pdf_file_path
    self.password = password

def getText(self):
    """ Algorithm:
    1) Txr information from PDF file to PDF document object using parser
    2) Open the PDF file
    3) Parse the file using PDFParser object
    4) Assign the parsed content to PDFDocument object
    5) Now the information in this PDFDocumet object has to be processed.
    For this we need PDFPageInterpreter, PDFDevice and PDFResourceManager
    6) Finally process the file page by page
    """

    # Open and read the pdf file in binary mode
    with open(self.pdf_file_path, "rb") as fp:

        # Create parser object to parse the pdf content
        parser = PDFParser(fp)

        # Store the parsed content in PDFDocument object
        document = PDFDocument(parser, self.password)

        # Check if document is extractable, if not abort
        if not document.is_extractable:
            raise PDFTextExtractionNotAllowed

        # Create PDFResourceManager object that stores shared resources
        # such as fonts or images
        rsrcmgr = PDFResourceManager()

        # set parameters for analysis
        laparams = LAParams()

        # Create a PDFDevice object which translates interpreted
        # information into desired format
        # Device to connect to resource manager to store shared resources
        # device = PDFDevice(rsrcmgr)
        # Extract the decive to page aggregator to get LT object elements
        device = PDFPageAggregator(rsrcmgr, laparams=laparams)

        # Create interpreter object to process content from PDFDocument
        # Interpreter needs to be connected to resource manager for shared
        # resources and device
        interpreter = PDFPageInterpreter(rsrcmgr, device)

        # Initialize the text
        extracted_text = ""

        # Ok now that we have everything to process a pdf document,
        # lets process it page by page
        for page in PDFPage.create_pages(document):
            # As the interpreter processes the page stored in PDFDocument
            # object
            interpreter.process_page(page)
            # The device renders the layout from interpreter
            layout = device.get_result()
            # Out of the many LT objects within layout, we are interested
            # in LTTextBox and LTTextLine
            for lt_obj in layout:
                if (isinstance(lt_obj, LTTextBox) or
                        isinstance(lt_obj, LTTextLine)):
                    extracted_text += lt_obj.get_text()

    return extracted_text.encode("utf-8")

注意 - 还有其他库,如PyPDF2,可以很好地转换PDF,例如合并PDF页面,从PDF中拆分或裁剪特定页面等。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接