如何使用Python和Camelot从PDF文件中提取表格名称和表格?

3
我正在尝试使用Python中的camelot从PDF文件中提取表格及其名称。虽然我知道如何使用camelot提取表格(这很简单明了),但是我在寻求如何提取表格名称的帮助时遇到了困难。我的意图是提取此信息并展示一个表格的可视化,以便用户从列表中选择相关的表格。
我已经尝试从PDF文件中提取表格和文本。两者都成功了,但是我没有将表格名称与表格连接起来的方法。
def tables_from_pdfs(filespath):
    pdffiles = glob.glob(os.path.join(filespath, "*.pdf"))
    print(pdffiles)
    dictionary = {}
    keys = []
    for file in pdffiles:
        print(file)
        n = PyPDF2.PdfFileReader(open(file, 'rb')).getNumPages()
        print(n)
        tables_dict = {}
        for i in range(n):
            tables = camelot.read_pdf(file, pages = str(i))
            tables_dict[i] = tables
        head, tail = os.path.split(file)
        tail = tail.replace(".pdf", "")
        keys.append(tail)
        dictionary[tail] = tables_dict
    return dictionary, keys

预期结果是一个表格及其在pdf文件中所列的名称。例如: PDF第x页上的表格名称为:Table 1. Blah Blah blah '''表格'''

你发布的代码并没有代表你尝试获取表名的任何内容。Camelot-py并不能提供你所需要的东西。我建议使用pdfminer或PyPDF2来读取具有位置绑定的PDF对象并提取表名。 - ExtractTable.com
请阅读此链接:https://dev59.com/Ibbna4cB1Zd3GeqPeaY3没有通用的解决方案。 - Stefano Fiorucci - anakin87
这个回答解决了你的问题吗?使用Camelot解析Python PDF并提取表格标题 - Brian Wylie
2个回答

0

我能找到一种相对的解决方案,至少对我有效。

import os, PyPDF2, time, re, shutil
import pytesseract
from pdf2image import convert_from_path
import camelot
import datefinder
from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

similarityAmt = 0.6 # find with 60% similarity
def find_table_name(dataframe, documentString):
    
    # Assuming that you extracted the text from a PDF, it should be multi-lined. We split by line
    stringsSeparated = text.split("\n")
    for i, string in enumerate(stringsSeparated):
        
        # Split by word
        words = string.split()
        for k, word in enumerate(words):
            
            # Get the keys from the dataframe as a list (it is initially extracted as a generator type)
            dfList = list(dataframe.keys())
            keys = str(dfList)
            
            # If the first key is a digit, we assume that the keys are from the row below the keys instead
            if keys[0].isdigit():
                keys = dataframe[dfList[0]]

            # Put all of the keys in a single string
            keysAll = ""
            for key in keys:
                keysAll += key

            # Since a row should be horizontal, we check the similarity between that of the text by line.
            similarRating = similar(words, keysAll)
            if similarRating > similarityAmt: # If similarity rating (which is a ratio from 0 to 1) is above the similarity amount, we approve of it
                for j in range(10): # Iterate upwards 10 lines above until we are capable of finding a line that is longer than 4 characters (this is an arbitrary number just to ignore blank lines).
                    try:
                        separatedString = stringsSeparated[i-j-1]
                        if len(separatedString) > 4:
                            return stringsSeparated[i-j-2]+separatedString # Return the top two lines to hopefully have an accurate name
                        else:
                            continue
                    except:
                        continue
    return "Unnamed"

# Retreive the text from the pdf
pages = convert_from_path(pdf_path, 500) # pdf_path would be the path of the PDF which you extracted the table from
pdf_text = ""
# Add all page strings into a single string, so the entire PDF is one single string
for pageNum, imgBlob in enumerate(pages):
    extractedText = pytesseract.image_to_string(imgBlob, lang='eng')
    pdf_text += extractedText + "\n"

# Get the name of the table using the table itself and pdf text
tableName = find_table_name(table.df, pdf_text) # A table you extracted with your code, which you want to find the name of

1
在代码中添加注释/解释会很有帮助。 - Ganesh Jadhav
谢谢你提醒我。希望这些评论能让它更有意义。 - Connor White

-3

在camelot API中,使用TableList和Table函数列出表格,该API可以在此处找到: https://camelot-py.readthedocs.io/en/master/api.html#camelot.core.TableList


从网页上开始,找到写着:


底层类


Camelot没有对表格名称的引用,只有单元格数据描述。 不过它确实使用了Python的panda数据库API,该API可能会包含表格名称。


结合使用Camelot和Pandas获取表名。


获取pandas DataFrame的名称


追加更新答案


来自https://camelot-py.readthedocs.io/en/master/

import camelot
tables = camelot.read_pdf('foo.pdf')
tables
<TableList n=1>
tables.export('foo.csv', f='csv', compress=True) # json, excel, html
tables[0]
<Table shape=(7, 7)>
tables[0].parsing_report
{
'accuracy': 99.02,
'whitespace': 12.24,
'order': 1,
'page': 1
}
tables[0].to_csv('foo.csv') # to_json, to_excel, to_html
df_table = tables[0].df # get a pandas DataFrame!

#add
df_table.name = 'name here'


#from https://dev59.com/-VwZ5IYBdhLWcg3wM97Z
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'

print df.name

注意:添加的“name”属性不是df的一部分。在序列化df时,添加的名称属性会丢失。


更多追加的答案,'name'属性实际上被称为'index'。
Getting values

>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
...      index=['cobra', 'viper', 'sidewinder'],
...      columns=['max_speed', 'shield'])
>>> df
            max_speed  shield
cobra               1       2
viper               4       5
sidewinder          7       8

Single label. Note this returns the row as a Series.

>>> df.loc['viper']
max_speed    4
shield       5
Name: viper, dtype: int64

1
我们正在搜索的名称不属于表格,因此它不是数据框的一部分。我认为你的答案并没有解决问题。 - Stefano Fiorucci - anakin87
嗨,乔,谢谢你的回复。我已经查看了文档,但仍然找不到答案。我对文本相关的包(主要是camelot)相对较新。你能否再指导我一下,并展示可以使用的函数?谢谢,Vijay - Vijay
是的,完成了。请注意,您必须向df添加“name”属性,但其中一些具有该属性的场景将丢失该数据。 - Joe McKenna
谢谢Joe。我认为代码是在赋予一个名称而不是从PDF中提取名称。Anakin87建议该名称不属于表格,因此我们提取的内容将不包含该名称。我正在尝试以作者编写的方式从PDF文件中获取表格名称 :) - Vijay
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc - Joe McKenna
“loc”函数是人们正在寻找的吗?请参见https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc中的示例。请注意,“name”实际上被称为“index”。 - Joe McKenna

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接