如何在Python中从PDF文件中提取表格？

Question

如何在Python中从PDF文件中提取表格？

6

我有成千上万个只包含表格的PDF文件，结构如下所示：pdf file。虽然它们相当有结构，但我无法在不失去结构的情况下读取这些表格。我尝试了PyPDF2，但数据完全混乱。

import PyPDF2 

pdfFileObj = open(pdf_file.pdf, 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pageObj = pdfReader.getPage(0) 

print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0]) 
print(pageObj.extractText().split('/')[0])

我也试过Tabula，但它只能读取表头（而不能读取表格内容）。

from tabula import read_pdf

pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content

有什么想法吗？

- fmarques

尝试使用 tabula-py：https://pypi.org/project/tabula-py/ - ilja

4个回答

2

使用库tabula

pip install tabula

然后提取它。

import tabula

# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)

# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)

df[1]

顺便说一下，我尝试使用另一种方法读取PDF文件。结果比库tabula更好。我很快就会发布它。

- zzhapar

8

pip install tabula 实际上安装的是 https://github.com/ronniedada/tabula ，这不是你想要的，试试 tabula-py。 - Noxeus

好的。非常感谢！ - zzhapar

2

尝试这个：pip安装tabula-py。最初的回答。

 from tabula import read_pdf
 df = read_pdf("file_name.pdf")

- ashishmishra

1

这是我在问题中发布的第二个代码。Tabula只读取表格的标题，而不是内容。当它读取内容时，它只读取几行。 - fmarques

2

@fmarques

您也可以尝试一下新的 Python 包（SLICEmyPDF），这是由 StatCan 专门开发用于从 PDF 中提取表格数据的： https://github.com/StatCan/SLICEmyPDF

从我的经验来看，SLICEmyPDF 的性能优于其他免费的 Python 或 R 包。但需要安装一些额外的免费软件。安装说明可以在以下链接中找到：

https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html

- 123456

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- fmarques · Accepted Answer

经过一番努力，我找到了方法。

对于文件的每一页，需要在tabula的read_pdf函数中定义表格区域和列的限制。

以下是可行的代码:

import pypdf
from tabula import read_pdf

# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)

# For each page the table can be read with the following code
table_pdf = read_pdf(
    pdf_file,
    guess=False,
    pages=1,
    stream=True,
    encoding="utf-8",
    area=(96, 24, 558, 750),
    columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)