从扫描的文档图像中提取没有网格线和边框的表格数据

7
从数字化PDF中提取表格数据已经可以通过camelottabula轻松实现。然而,当表格不具备边框和内部网格时,该解决方案无法处理扫描图像的文档页面。我一直在尝试使用OpenCV生成垂直和水平线条。然而,由于扫描图像会有轻微旋转角度,因此很难继续使用这种方法。
如何利用OpenCV为包含表格数据的扫描文档页面生成网格(水平和垂直线)和边框?如果可行,如何抵消扫描图像的旋转角度?

6
请提供需要翻译的完整语境,以便我更好地理解您所需的翻译。 - lucians
您可以使用Pytesseract OCR读取任何文档的扫描图像中的数据。根据情况,您可能需要进行一些预处理,例如灰度转换、形态学操作或连接组件分析。请分享您的图像,以便我们帮助您解决问题 ;) - Usama Aleem
Aleem是正确的;我真的建议使用pytesseract和opencv进行一些预处理,例如:https://fazlurnu.com/2020/06/23/text-extraction-from-a-table-image-using-pytesseract-and-opencv/。 - t2solve
你尝试过使用亚马逊Textract吗?我用它将扫描的收据中的所有费用提取成表格形式,并且在我的情况下它运行良好。 - Majico
1
这是面试相关的吗,恰巧是吗? - Barry the Platipus
显示剩余2条评论
2个回答

2
我写了一些代码来估计页面上印刷字母的水平线。我猜垂直线也可以这样做。下面的代码遵循一些常见的假设,在此提供一些伪代码风格的基本步骤:
  • 为轮廓检测准备图片

  • 进行轮廓检测

  • 我们假设大多数轮廓都是字母

    • 计算所有轮廓的平均宽度
    • 计算轮廓的平均面积
  • 使用两个条件过滤所有轮廓: a) 轮廓(字母)高度 < 平均高度 * 2 b) 轮廓面积 > 4/5 平均面积

  • 计算所有剩余轮廓的中心点

  • 假设我们有线性区域(箱子)

    • 列出所有在该区域内的中心点
    • 对区域点进行线性回归
    • 保存斜率和截距
  • 计算平均斜率和截距

这是完整的代码:

import cv2
import numpy as np
from scipy import stats

def resizeImageByPercentage(img,scalePercent = 60):
    width = int(img.shape[1] * scalePercent / 100)
    height = int(img.shape[0] * scalePercent / 100)
    dim = (width, height)
    # resize image
    return cv2.resize(img, dim, interpolation = cv2.INTER_AREA)

def calcAverageContourWithAndHeigh(contourList):
    hs = list()
    ws = list()
    for cnt in contourList:
        (x, y, w, h) = cv2.boundingRect(cnt)
        ws.append(w)
        hs.append(h)
    return np.mean(ws),np.mean(hs)

def calcAverageContourArea(contourList):
    areaList = list()
    for cnt in contourList:
        a = cv2.minAreaRect(cnt)
        areaList.append(a[2])
    return np.mean(areaList)

def calcCentroid(contour):
    houghMoments = cv2.moments(contour)
    # calculate x,y coordinate of centroid
    if houghMoments["m00"] != 0: #case no contour could be calculated
        cX = int(houghMoments["m10"] / houghMoments["m00"])
        cY = int(houghMoments["m01"] / houghMoments["m00"])
    else:
    # set values as what you need in the situation
        cX, cY = -1, -1
    return cX,cY

def getCentroidWhenSizeInRange(contourList,letterSizeWidth,letterSizeHigh,deltaOffset,minLetterArea=10.0):
    centroidList=list()
    for cnt in contourList:
        (x, y, w, h) = cv2.boundingRect(cnt)
        area = cv2.minAreaRect(cnt)

        #calc diff
        diffW = abs(w-letterSizeWidth) 
        diffH = abs(h-letterSizeHigh)
        #thresold A: almost smaller than mean letter size +- offset
        #when almost letterSize
        if diffW < deltaOffset and diffH < deltaOffset:
            #threshold B > min area
            if area[2] > minLetterArea:
                cX,cY = calcCentroid(cnt)
                if cX!=-1 and cY!=-1:
                    centroidList.append((cX,cY))
    return centroidList
    
DEBUGMODE = True
#read image, do git clone https://github.com/WZBSocialScienceCenter/pdftabextract.git for the example
img = cv2.imread('pdftabextract/examples/catalogue_30s/data/ALA1934_RR-excerpt.pdf-2_1.png')
#get some basic infos
imgHeigh, imgWidth, imgChannelAmount = img.shape

if DEBUGMODE:
    cv2.imwrite("img00original.jpg",resizeImageByPercentage(img,30))
    cv2.imshow("original",img)

# prepare img 
imgGrey = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# apply Gaussian filter
imgGaussianBlur = cv2.GaussianBlur(imgGrey,(5,5),0)
#make binary img, black or white
_, imgBinThres = cv2.threshold(imgGaussianBlur, 130, 255, cv2.THRESH_BINARY)

## detect contours
contours, _ = cv2.findContours(imgBinThres, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)

#we get some letter parameter
averageLetterWidth, averageLetterHigh = calcAverageContourWithAndHeigh(contours)
threshold1AllowedLetterSizeOffset =   averageLetterHigh * 2  # double size
averageContourAreaSizeOfMinRect = calcAverageContourArea(contours)
threshHold2MinArea = 4 * averageContourAreaSizeOfMinRect / 5 # 4/5 * mean

print("mean letter Width: ", averageLetterWidth)
print("mean letter High: ", averageLetterHigh)
print("threshold 1 tolerance: ", threshold1AllowedLetterSizeOffset)
print("mean letter area ", averageContourAreaSizeOfMinRect)
print("thresold 2 min letter area ", threshHold2MinArea)

#we get all centroid of letter sizes contours, the other we ignore
centroidList = getCentroidWhenSizeInRange(contours,averageLetterWidth,averageLetterHigh,threshold1AllowedLetterSizeOffset,threshHold2MinArea)

if DEBUGMODE:
    #debug print all centers:
    imgFilteredCenter = img.copy()
    for cX,cY in centroidList:
        #draw in red color as  BGR
        cv2.circle(imgFilteredCenter, (cX, cY), 5, (0, 0, 255), -1)
    cv2.imwrite("img01letterCenters.jpg",resizeImageByPercentage(imgFilteredCenter,30))
    cv2.imshow("letterCenters",imgFilteredCenter)

#we estimate a bin widths
amountPixelFreeSpace = averageLetterHigh #TODO get better estimate out of histogram
estimatedBinWidth = round( averageLetterHigh + amountPixelFreeSpace) #TODO round better ?
binCollection = dict() #range(0,imgHeigh,estimatedBinWidth)

#we do seperate the center points into bins by y coordinate
for i in range(0,imgHeigh,estimatedBinWidth):
    listCenterPointsInBin = list()
    yMin = i 
    yMax = i + estimatedBinWidth
    for cX,cY in centroidList:
        if yMin < cY < yMax:#if fits in bin
            listCenterPointsInBin.append((cX,cY))

    binCollection[i] = listCenterPointsInBin
 
#we assume all point are in one line ? 
#model = slope (x) + intercept
#model = m (x) + n
mList = list() #slope abs in img
nList = list() #intercept abs in img
nListRelative = list() #intercept relative to bin start
minAmountRegressionElements = 12 #is also alias for letter amount we expect
#we do regression for every point in the bin 
for startYOfBin, values in binCollection.items():
    #we reform values
    xValues = [] #TODO use more short transform
    yValues = [] 
    for x,y in values:
        xValues.append(x)
        yValues.append(y)

    #we assume a min limit of point in bin 
    if len(xValues) >= minAmountRegressionElements :
        slope, intercept, r, p, std_err = stats.linregress(xValues, yValues)
        mList.append(slope)
        nList.append(intercept)
        #we calc the relative intercept
        nRelativeToBinStart = intercept - startYOfBin  
        nListRelative.append(nRelativeToBinStart)

if DEBUGMODE:
    #we debug print all lines in one picute
    imgLines = img.copy()
    colorOfLine = (0, 255, 0) #green
    for i in range(0,len(mList)):
        slope = mList[i]
        intercept = nList[i]
        startPoint = (0, int( intercept)) #better round ? 
        endPointY = int( (slope * imgWidth + intercept) )
        if endPointY < 0:
            endPointY = 0
        endPoint = (imgHeigh,endPointY)
        cv2.line(imgLines, startPoint, endPoint, colorOfLine, 2) 

    cv2.imwrite("img02lines.jpg",resizeImageByPercentage(imgLines,30))
    cv2.imshow("linesOfLetters ",imgLines)

#we assume in mean we got it right
meanIntercept = np.mean(nListRelative)
meanSlope = np.mean(mList)
print("meanIntercept :", meanIntercept)
print("meanSlope ", meanSlope)

#TODO calc angle with math.atan(slope) ...

if DEBUGMODE:
    cv2.waitKey(0)

原始图片: 原始图片 字母中心点: 字母中心点 线条: 线条


1

我曾经也遇到过同样的问题,但这篇教程解决了这个问题。它介绍了如何使用pdftabextract这个Python库,该库由Markus Konrad开发,并利用OpenCV的Hough变换来检测线条,甚至可以处理有点倾斜的扫描文档。该教程将引导您解析1920年代的德国报纸。 enter image description here


1
问题在于它仍然通过识别行来运作。我们正在寻找解决方案,能够在无边框表格中很好地工作,例如:https://www.researchgate.net/post/How-to-extract-the-non-gridded-table-from-the-scanned-documents - Miranda
1
如果您检查实际的Github存储库示例,您还可以看到您使用情况的屏幕截图。https://github.com/WZBSocialScienceCenter/pdftabextract/tree/master/examples - alfx

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接