我将尝试从pdf文档中提取一些表格信息。
考虑以下输入:
Title 1
some text some text some text some text some text
some text some text some text some text some text
Table Title
| Col1 | Col2 | Col3 |
|---------------|---------|---------|
| val11 | val12 | val13 |
| val21 | val22 | val23 |
| val31 | val32 | val33 |
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text
我可以这样获得轮廓/标题:
我可以这样获得轮廓/标题:
path='myFile.pdf'
# Open a PDF file.
fp = open(path, 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, '')
outlines = document.get_outlines()
for (level,title,dest,a,se) in outlines:
print (level, title)
这给了我:
(1, u'Title 1')
(2, u'Table Title')
(1, u'Title 2')
正好,因为级别与文本层次结构对齐。现在我可以按以下方式提取文本:
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
laparams = LAParams()
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
text_from_pdf = open('textFromPdf.txt','w')
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout = device.get_result()
for element in layout:
if isinstance(element, LTTextBox):
text_from_pdf.write(''.join([i if ord(i) < 128 else ' '
for i in element.get_text()]))
这给了我:
Title 1
some text some text some text some text some text some text some text
some text some text some text some text some text some text some text
Table Title
Col1
val11
val12
val13
Col2
val21
val22
val23
Col3
val31
val32
val33
Title 2
some more text some more text some more text some more text
some more text
some more text some more text some more text some more text
这有点奇怪,因为表格是按列提取的。我能否逐行获取表格?此外,如何确定表格的起始和结束位置?
zip()
函数完成。至于找到表格的结尾,您需要查看是否可以检测到某种格式上的变化。 - martineau