我想使用 pdfminer
(版本20140328)提取PDF文件。
以下是提取PDF的代码:
import sys
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from cStringIO import StringIO
import urllib2
def pdf_to_string(data):
fp = StringIO(data)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.get_pages(fp):
interpreter.process_page(page)
data = retstr.getvalue()
return data
pdf_url="http://www.consilium.europa.eu/uedocs/cms_data/docs/pressdata/en/ecofin/140836.pdf"
file_object = urllib2.urlopen(urllib2.Request(pdf_url)).read()
string=pdf_to_string(file_object)
这是PDF的屏幕截图:
![enter image description here](https://istack.dev59.com/4enqr.webp)
pdfminer
不是按照水平方式(人员然后职位)读取,而是按列方式(所有人员然后他们各自的职位)。Belgium:
Mr Koen GEENS
Bulgaria:
Mr Petar CHOBANOV
Czech Republic:
Mr Radek URBAN
Minister for Finance, with responsibility for the Civil
Service
Minister for Finance
Deputy Minister for Finance
如何使
pdfminer
横向读取文本?