如何从PDF文件中提取文本?

357

我正在尝试使用Python提取PDF文件中包含的文本。

我正在使用PyPDF2包(版本1.27.2),并拥有以下脚本:

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.pages[0]
    page_content = page.extractText()
print(page_content)

当我运行代码时,我的输出结果与PDF文档中的不同:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

我如何提取PDF文档中原样的文本?


8
请使用一个好的PDF查看器——如果可能的话,使用Adobe的标准Acrobat Reader复制文本。你得到了相同的结果吗?不同之处不在于文本,而是在于字体——字符代码映射到其他值。并非所有PDF都包含正确的数据来恢复这种情况。 - Jongware
我尝试了另一个文档,它可以工作。是的,看起来问题出在PDF本身。 - Simplicity
7
该PDF包含一个字符CMap表,因此在本线程中讨论的限制和解决方法是相关的 - https://dev59.com/VVHTa4cB1Zd3GeqPV_Tm. - dwarring
3
PDF文件中确实包含正确的CMAP,因此将临时字符映射转换为纯文本非常简单。但是,需要进行额外的处理才能检索到正确的文本顺序。Mac OS X的Quartz PDF渲染器是一个非常棘手的工具!在它最初的呈现顺序中,我收到了"m T’h iuss iisn ga tosam fopllloew DalFo dnogc wumithe ntht eI tutorial"。只有通过x坐标排序后,我才得到了一个更有可能正确的结果:"This is a sample PDF document I’m using to follow along with the tutorial"。 - Jongware
1
PyPDF2在单词之间/内添加随机空格,非常难以处理。 - YuMei
显示剩余4条评论
34个回答

2

顺序不正确。 - Vikas Viki

0
我正在添加代码来完成这个任务: 对我来说它运行得很好:
# This works in python 3
# required python packages
# tabula-py==1.0.0
# PyPDF2==1.26.0
# Pillow==4.0.0
# pdfminer.six==20170720

import os
import shutil
import warnings
from io import StringIO

import requests
import tabula
from PIL import Image
from PyPDF2 import PdfFileWriter, PdfFileReader
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage

warnings.filterwarnings("ignore")


def download_file(url):
    local_filename = url.split('/')[-1]
    local_filename = local_filename.replace("%20", "_")
    r = requests.get(url, stream=True)
    print(r)
    with open(local_filename, 'wb') as f:
        shutil.copyfileobj(r.raw, f)

    return local_filename


class PDFExtractor():
    def __init__(self, url):
        self.url = url

    # Downloading File in local
    def break_pdf(self, filename, start_page=-1, end_page=-1):
        pdf_reader = PdfFileReader(open(filename, "rb"))
        # Reading each pdf one by one
        total_pages = pdf_reader.numPages
        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            output = PdfFileWriter()
            output.addPage(pdf_reader.getPage(i))
            with open(str(i + 1) + "_" + filename, "wb") as outputStream:
                output.write(outputStream)

    def extract_text_algo_1(self, file):
        pdf_reader = PdfFileReader(open(file, 'rb'))
        # creating a page object
        pageObj = pdf_reader.getPage(0)

        # extracting extract_text from page
        text = pageObj.extractText()
        text = text.replace("\n", "").replace("\t", "")
        return text

    def extract_text_algo_2(self, file):
        pdfResourceManager = PDFResourceManager()
        retstr = StringIO()
        la_params = LAParams()
        device = TextConverter(pdfResourceManager, retstr, codec='utf-8', laparams=la_params)
        fp = open(file, 'rb')
        interpreter = PDFPageInterpreter(pdfResourceManager, device)
        password = ""
        max_pages = 0
        caching = True
        page_num = set()

        for page in PDFPage.get_pages(fp, page_num, maxpages=max_pages, password=password, caching=caching,
                                      check_extractable=True):
            interpreter.process_page(page)

        text = retstr.getvalue()
        text = text.replace("\t", "").replace("\n", "")

        fp.close()
        device.close()
        retstr.close()
        return text

    def extract_text(self, file):
        text1 = self.extract_text_algo_1(file)
        text2 = self.extract_text_algo_2(file)

        if len(text2) > len(str(text1)):
            return text2
        else:
            return text1

    def extarct_table(self, file):

        # Read pdf into DataFrame
        try:
            df = tabula.read_pdf(file, output_format="csv")
        except:
            print("Error Reading Table")
            return

        print("\nPrinting Table Content: \n", df)
        print("\nDone Printing Table Content\n")

    def tiff_header_for_CCITT(self, width, height, img_size, CCITT_group=4):
        tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
        return struct.pack(tiff_header_struct,
                           b'II',  # Byte order indication: Little indian
                           42,  # Version number (always 42)
                           8,  # Offset to first IFD
                           8,  # Number of tags in IFD
                           256, 4, 1, width,  # ImageWidth, LONG, 1, width
                           257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                           258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                           259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                           262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                           273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                           278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                           279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of extract_image
                           0  # last IFD
                           )

    def extract_image(self, filename):
        number = 1
        pdf_reader = PdfFileReader(open(filename, 'rb'))

        for i in range(0, pdf_reader.numPages):

            page = pdf_reader.getPage(i)

            try:
                xObject = page['/Resources']['/XObject'].getObject()
            except:
                print("No XObject Found")
                return

            for obj in xObject:

                try:

                    if xObject[obj]['/Subtype'] == '/Image':
                        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                        data = xObject[obj]._data
                        if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        else:
                            mode = "P"

                        image_name = filename.split(".")[0] + str(number)

                        print(xObject[obj]['/Filter'])

                        if xObject[obj]['/Filter'] == '/FlateDecode':
                            data = xObject[obj].getData()
                            img = Image.frombytes(mode, size, data)
                            img.save(image_name + "_Flate.png")
                            # save_to_s3(imagename + "_Flate.png")
                            print("Image_Saved")

                            number += 1
                        elif xObject[obj]['/Filter'] == '/DCTDecode':
                            img = open(image_name + "_DCT.jpg", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_DCT.jpg")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/JPXDecode':
                            img = open(image_name + "_JPX.jp2", "wb")
                            img.write(data)
                            # save_to_s3(imagename + "_JPX.jp2")
                            img.close()
                            number += 1
                        elif xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                            if xObject[obj]['/DecodeParms']['/K'] == -1:
                                CCITT_group = 4
                            else:
                                CCITT_group = 3
                            width = xObject[obj]['/Width']
                            height = xObject[obj]['/Height']
                            data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                            img_size = len(data)
                            tiff_header = self.tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                            img_name = image_name + '_CCITT.tiff'
                            with open(img_name, 'wb') as img_file:
                                img_file.write(tiff_header + data)

                            # save_to_s3(img_name)
                            number += 1
                except:
                    continue

        return number

    def read_pages(self, start_page=-1, end_page=-1):

        # Downloading file locally
        downloaded_file = download_file(self.url)
        print(downloaded_file)

        # breaking PDF into number of pages in diff pdf files
        self.break_pdf(downloaded_file, start_page, end_page)

        # creating a pdf reader object
        pdf_reader = PdfFileReader(open(downloaded_file, 'rb'))

        # Reading each pdf one by one
        total_pages = pdf_reader.numPages

        if start_page == -1:
            start_page = 0
        elif start_page < 1 or start_page > total_pages:
            return "Start Page Selection Is Wrong"
        else:
            start_page = start_page - 1

        if end_page == -1:
            end_page = total_pages
        elif end_page < 1 or end_page > total_pages - 1:
            return "End Page Selection Is Wrong"
        else:
            end_page = end_page

        for i in range(start_page, end_page):
            # creating a page based filename
            file = str(i + 1) + "_" + downloaded_file

            print("\nStarting to Read Page: ", i + 1, "\n -----------===-------------")

            file_text = self.extract_text(file)
            print(file_text)
            self.extract_image(file)

            self.extarct_table(file)
            os.remove(file)
            print("Stopped Reading Page: ", i + 1, "\n -----------===-------------")

        os.remove(downloaded_file)


# I have tested on these 3 pdf files
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Healthcare-January-2017.pdf"
url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sample_Test.pdf"
# url = "http://s3.amazonaws.com/NLP_Project/Original_Documents/Sazerac_FS_2017_06_30%20Annual.pdf"
# creating the instance of class
pdf_extractor = PDFExtractor(url)

# Getting desired data out
pdf_extractor.read_pages(15, 23)

0

您可以从这里下载最新版本的tika-app-xxx.jar。

然后将此 .jar 文件放置在您的 Python 脚本文件所在的同一文件夹中。

接下来,在脚本中插入以下代码:

import os
import os.path

tika_dir=os.path.join(os.path.dirname(__file__),'<tika-app-xxx>.jar')

def extract_pdf(source_pdf:str,target_txt:str):
    os.system('java -jar '+tika_dir+' -t {} > {}'.format(source_pdf,target_txt))

该方法的优点:

依赖较少。单个.jar文件比Python包更易于管理。

多格式支持。位置source_pdf可以是任何类型文档的目录。(.doc,.html,.odt等)

最新版本。tika-app.jar始终早于tika python包的相关版本发布。

稳定性。它比PyPDF更稳定和良好维护(由Apache支持)。

缺点:

需要jre-headless。


完全不是Python的解决方案。如果您建议这样做,应该构建一个Python包并让人们导入它。不建议在Python中使用命令行执行Java代码。 - Michael Tamillow
@MichaelTamillow,如果要编写将要上传到pypi的代码,我承认这不是一个好主意。但是,如果只是为了临时使用的带有shebang的Python脚本,那还不错,对吧? - pah8J
好的,这个问题标题并没有写明“Python” - 所以我认为说明“以下是如何在Java中完成它”比这个更可接受。从技术上讲,在Python中你可以做任何你想做的事情。这就是为什么它既很棒又很糟糕。临时使用是一个坏习惯。 - Michael Tamillow

0

试用 borb,一个纯Python PDF库

import typing  
from borb.pdf.document import Document  
from borb.pdf.pdf import PDF  
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction  


def main():

    # variable to hold Document instance
    doc: typing.Optional[Document] = None  

    # this implementation of EventListener handles text-rendering instructions
    l: SimpleTextExtraction = SimpleTextExtraction()  

    # open the document, passing along the array of listeners
    with open("input.pdf", "rb") as in_file_handle:  
        doc = PDF.loads(in_file_handle, [l])  
  
    # were we able to read the document?
    assert doc is not None  

    # print the text on page 0
    print(l.get_text(0))  

if __name__ == "__main__":
    main()


你如何使用borb获取文档的总页数?(或者你如何直接获取完整的文本?) - Martin Thoma

0
如果您在Windows上的Anaconda中尝试使用PyPDF2处理一些具有非标准结构或Unicode字符的PDF文件,则可能会出现问题。如果您需要打开和阅读大量PDF文件,我建议使用以下代码 - 相对路径 .//pdfs// 文件夹中所有PDF文件的文本将存储在列表 pdf_text_list 中。
from tika import parser
import glob

def read_pdf(filename):
    text = parser.from_file(filename)
    return(text)


all_files = glob.glob(".\\pdfs\\*.pdf")
pdf_text_list=[]
for i,file in enumerate(all_files):
    text=read_pdf(file)
    pdf_text_list.append(text['content'])

print(pdf_text_list)

0
一个更健壮的方式,假设有多个PDF文件或只有一个!
import os
from PyPDF2 import PdfFileWriter, PdfFileReader
from io import BytesIO

mydir = # specify path to your directory where PDF or PDF's are

for arch in os.listdir(mydir): 
    buffer = io.BytesIO()
    archpath = os.path.join(mydir, arch)
    with open(archpath) as f:
            pdfFileObj = open(archpath, 'rb')
            pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
            pdfReader.numPages
            pageObj = pdfReader.getPage(0) 
            ley = pageObj.extractText()
            file1 = open("myfile.txt","w")
            file1.writelines(ley)
            file1.close()
            

所有的PyPDF衍生版本在2021年已经停止更新。请将此答案视为过时。 - harmonica141

0

提取PDF中的文本,请使用以下代码

import PyPDF2
pdfFileObj = open('mypdf.pdf', 'rb')

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

print(pdfReader.numPages)

pageObj = pdfReader.getPage(0)

a = pageObj.extractText()

print(a)

1
PyPDF2 / PyPDF3 / PyPDF4 都已经停止维护。请使用 pymupdf。详情请见:[https://dev59.com/51sW5IYBdhLWcg3wi32j#63518022] - Martin Thoma

-1

Camelot 看起来是 Python 中从 PDF 中提取表格的相当强大的解决方案。

乍一看,它似乎实现了几乎与 CreekGeek 建议的 tabula-py 包一样准确的提取,后者在可靠性方面已经远远超过了今天发布的任何其他解决方案,但据说它 更加可配置。此外,它还有自己的准确性指标(results.parsing_report)和出色的调试功能。

Camelot 和 Tabula 都将结果作为 Pandas 的数据帧提供,因此很容易在之后调整表格。

pip install camelot-py

(不要与camelot包混淆。)
import camelot

df_list = []
results = camelot.read_pdf("file.pdf", ...)
for table in results:
    print(table.parsing_report)
    df_list.append(results[0].df)

它还可以将结果输出为CSV、JSON、HTML或Excel。

使用Camelot需要一些依赖。

NB:由于我的输入非常复杂,包含许多不同的表格,因此我最终使用了同时使用Camelot和Tabula,根据表格来实现最佳结果。


-1

目标:从PDF中提取文本

所需工具:

  1. Poppler for windows:在Windows中pdftotext文件的包装器 对于anaconda:conda install -c conda-forge

  2. pdftotext:将PDF转换为文本的实用程序。

步骤: 安装Poppler。对于Windows,请将“xxx/bin/”添加到环境路径中 pip install pdftotext

import pdftotext
 
# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
 
# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

-1

这包括根据文档中的页数动态设置每个PDF页面的新工作表。

import PyPDF2 as p2
import xlsxwriter

pdfFileName = "sample.pdf"
pdfFile = open(pdfFileName, 'rb')
pdfread = p2.PdfFileReader(pdfFile)
number_of_pages = pdfread.getNumPages()
workbook = xlsxwriter.Workbook('pdftoexcel.xlsx')

for page_number in range(number_of_pages):
    print(f'Sheet{page_number}')
    pageinfo = pdfread.getPage(page_number)
    rawInfo = pageinfo.extractText().split('\n')

    row = 0
    column = 0
    worksheet = workbook.add_worksheet(f'Sheet{page_number}')

    for line in rawInfo:
        worksheet.write(row, column, line)
        row += 1
workbook.close()

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接