使用Python从格式化的PDF中提取文本

3

我需要解析一个经过格式化的PDF文件以获取某些字段。PDF文件在这里。我需要解析的内容在这个 imgur 图片中显示。我已经使用了PyPDF2获取文本,但它返回的是没有任何格式的纯文本。

import PyPDF2
pdfFileObj = open('GPO-PLUMBOOK-2000-4-1.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
print(pageObj.extractText())

我得到的输出如下:
LEGISLATIVE BRANCHLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresARCHITECT OF THE CAPITOLAlan M. HantmanWashington, DCArchitect of the Capitol10 years02/02/07IIIEXPASLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGENERAL ACCOUNTING OFFICEDavid M. WalkerWashington, DCComptroller General of the United States11/09/1315 years$141,300OTPASVacant  Do...........Deputy Comptroller General of the United States..................OTXSLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresGOVERNMENT PRINTING OFFICEMichael F. DiMarioWashington, DCPublic Printer............IIIEXPASRobert T. Mansker  Do...........Deputy Public Printer............IVEXXSFrancis J. Buckley, Jr.  Do...........Superintendent of Documents..................SLXSRobert G. Andary  Do...........Inspector General..................SLXSMary Beth Lawler  Do...........Staff Assistant............14OTSCLocationPosition TitleName of IncumbentPayPlanType ofAppt.Level,Grade, orPayTenureExpiresLIBRARY OF CONGRESSLIBRARIAN OF CONGRESSJames H. BillingtonWashington, DCLibrarian of Congress............IIIEXPASLIBRARY OF CONGRESS TRUST FUND BOARDJames H. Billington  Do...........Chairman (Ex-Officio)..................WCPASTed Stevens  Do...........Chairman of the Joint Committee of the Library (Ex-Officio)..................WCXSLawrence Summers  Do...........Member (Ex-Officio), Secretary of the Treasury..................WCPASDonald Hammond  Do...........Member (Designee for the Secretary of the Treasurer)..................WCXSCeil Pulitzer  Do...........Member5 years03/23/03......WCPASNajeeb Halaby  Do...........Member5 years08/31/05......WCPASJohn Kluge  Do...........Member5 years03/10/03......WCXSWayne Berman  Do...........Member5 years12/22/01......WCXSEdwin Cox  Do...........Member5 years03/31/04......WCXSJohn Henry  Do...........Member5 years12/22/03......WCXSDonald Jones  Do...........Member5 years10/08/02......WCXSJulie Finley  Do...........Member5 years06/29/01......WCXSBernard Rappaport  Do...........Member5 years12/22/01......WCXS(1)

我需要分离数据,例如在位置列下的数据等。

1个回答

0

看一下 tabula 库(这里是 github)。它会返回一个 pandas 数据帧。

df = tabula.read_pdf("/home/michael/Downloads/GPO-PLUMBOOK-2000-4-1.pdf", pages=1)
df.dropna(inplace=True)
print(df[:2])

您还可以调整使用 pdf 中的哪一部分,如果需要阅读其他表格或想节省时间。这样,您只需将所有 pdf 表格读入,然后将我的数据切片到所需的输出。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接