如何提高印地语文本提取？

Question

如何提高印地语文本提取？

11

我正在尝试从 PDF 中提取印地语文本。我尝试了从 PDF 中提取所有方法，但没有一个有效。有关为什么不起作用的说明，但没有答案。因此，我决定将 PDF 转换为图像，然后使用 pytesseract 提取文本。我已经下载了印地语训练数据，但是它也会给出高度不准确的文本。

这是 PDF 中实际的印地语文本（下载链接）：

这是我目前的代码：

import fitz

filepath = "D:\\BADI KA BANS-Ward No-002.pdf"

doc = fitz.open(filepath)
page = doc.loadPage(3)  # number of page
pix = page.getPixmap()
output = "outfile.png"
pix.writePNG(output)
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library
image = Image.open('outfile.png')

# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='hin')

# Print the text
print(image_to_text)

这是一些输出示例：

कार बिता देवी व ०... नाम बाइुनान िक०क नाक तो
पति का नाव: रवजी लात. “50९... पिला का सामशामाव.... “पति का नाम: बादुलल
कान सब: 43 लसमनंध्या: 93९. मकान ंब्या: 3९
आप: 29 _ लिंग सी. | आइ 57 लिंग पुरुष आप: 62 लिंग सी
एजगल्णब्णस्य (बन्द जगाख्मिणण्य
नमः बायगी बसों ०४... नि बयावर्णो ०५०... निफर सनक नी
चिता का नामजबूजल वर्ष.“ ००० | पिला का नामब्राइलाल वर्षो... 0 2... | पिता कामामशुल चब्द .... “20०
|सकानसंब्या: 43९ बसवकंब्या: 43९. कान संब्या: 44
जाए: 27 लिंग सो कई: 27 नि खी मा लिंग पुरुष

关于这个问题我想用Python爬取印地语（印度语言）PDF文件，有一个答案，似乎告诉了如何做到这一点，但没有任何解释。

有没有什么办法可以做到这一点？

- Abhishek Rai

我能联系您吗？我有一些问题。 - piedpiper

@piedpiper 我呢？ - Abhishek Rai

是的 @AbhishekRai - piedpiper

您可以在此处加入“PDF讨论”聊天室 https://chat.meta.stackexchange.com/rooms?tab=all&sort=active - Abhishek Rai

我做了@abhishekrai - piedpiper

@piedpiper 我已经到了。 - Abhishek Rai

3个回答

5

如果你想从这些“卡片”中获取文本，我已经通过模块tabula-py成功地在第3页进行了操作：

import tabula

pdf_file = "BADI KA BANS-Ward No-002.pdf"
page = 3

x = 30      # left edge of the table
y = 160     # top edge of the table
w = 173     # width of a card
h = 73      # height of a card
photo = 61  # width of a photo

rows = 8    # number of rows of the table
cols = 3    # number of columns of the table

counter = 1

def get_area(row, col):
    ''' return area of the card in given position in the table '''
    top    = y + h * row
    left   = x + w * col
    bottom = top + h
    right  = left + w - photo
    return (top, left, bottom, right)

for row in range(rows):
    for col in range(cols):
        file_name = "card_" + str(counter).zfill(3) + ".txt"
        tabula.convert_into(pdf_file, file_name,
        pages=page,
        output_format = "csv",
        java_options = "-Dfile.encoding=UTF8",
        lattice = False,
        area = get_area(row, col))
        counter += 1

输入：

输出

24个txt文件：

card_001.txt
card_002.txt
card_003.txt
card_004.txt
.
.
.
card_023.txt
card_024.txt

card_001.txt:

1 RBP2469583
नरम: आरतल चररलर
नपतर कर नरम:लरलर ररम चररल
मकरन सखजर: १९
आज:  21 ललग: सल

card_002.txt

2 MRQ3101367
नरम: सरज दरल
नपतर कर नरम:ररमररतरर
मकरन सखजर: रल /18
आज:  44 ललग: सल

card_024.txt

24 RBP0230979
नरम: सनमतकरर
पनत कर नरम: हररलसह
मकरन सखजर: 13
आज:  41 ललग: सल

据我所见，所有“卡片”的尺寸都是相同的。如果页面看起来相似，则可以将解决方案应用于所有页面。不幸的是，这些页面有所不同。因此，每个页面的初始变量必须更改。我认为没有办法自动进行更改，除非可以从卡片中获取卡片的数量而不是简单的计数器。

https://pypi.org/project/tabula-py/

https://aegis4048.github.io/parse-pdf-files-while-retaining-structure-with-tabula-py

- Yuri Khristich

2

如果您看一下问题，问题中的印地语输出比这个更好。您可以在输出中看到重复的字符。这是语法出了问题。单词也都错了。到目前为止，“pytesseract”是最接近正确的印地语的。我找遍了所有地方，都找不到合适的解决方法。非常感谢您的努力。 - Abhishek Rai

1

唉，我不会印地语。我的确很难看出输出中的错误。这些行看起来对我来说太花哨了。但如果PDF文件包含可编辑文本，并且该文本在屏幕上显示良好，并且复制/粘贴自PDF后仍然正确，则更加花哨。我想知道，这可能是因为编码吗？如果存在某些非UTF8的印地语编码呢？ - Yuri Khristich

-2

如果你想从 PDF 中准确地抓取文本，你应该在将图像转换成文本时使用正确的字体族和编码。

- Mritunjay

如何做到这一点？如何知道它使用的字体族？ - Anmol Deep

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- HansHirse · Accepted Answer

我将提供一些关于如何处理您的图像的想法，但我将限制在给定文档的第3页上，即问题中显示的页面。

为了将PDF页面转换为图像，我使用了pdf2image。

对于OCR，我使用pytesseract，但是我使用lang='Devanagari'而不是lang='hin'，请参见Tesseract GitHub。通常，请确保通过Tesseract文档中的Improving the quality of the output，特别是page segmentation method来提高输出质量。

以下是整个过程（详细）的描述：

将图像反二值化以查找轮廓：黑色背景上的白色文本、形状等。
查找所有轮廓，并过滤掉两个非常大的轮廓，即这两个轮廓是两个表格。
提取两个表格外的文本：
1. 在二值化图像中遮蔽掉表格。
2. 进行形态学闭合以连接剩余的文本行。
3. 查找这些文本行的轮廓和边界矩形。
4. 运行pytesseract来提取文本。
提取两个表格内的文本：
1. 从当前表格中提取单元格，更好的方法是提取其边界矩形。
2. 对于第一个表格:
  1. 直接使用pytesseract提取文本。
3. 对于第二个表格:
  1. 填充数字周围的矩形以防止错误的OCR输出。
  2. 遮蔽左（印地语）和右（英语）部分。
  3. 对左侧使用lang='Devaganari'，对右侧使用lang='eng'运行pytesseract，以提高两者的OCR质量。

这就是整个代码：

import cv2
import numpy as np
import pdf2image
import pytesseract

# Extract page 3 from PDF in proper quality
page_3 = np.array(pdf2image.convert_from_path('BADI KA BANS-Ward No-002.pdf',
                                              first_page=3, last_page=3,
                                              dpi=300, grayscale=True)[0])

# Inverse binarize for contour finding
thr = cv2.threshold(page_3, 128, 255, cv2.THRESH_BINARY_INV)[1]

# Find contours w.r.t. the OpenCV version
cnts = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]

# STEP 1: Extract texts outside of the two tables

# Mask out the two tables
cnts_tables = [cnt for cnt in cnts if cv2.contourArea(cnt) > 10000]
no_tables = cv2.drawContours(thr.copy(), cnts_tables, -1, 0, cv2.FILLED)

# Find bounding rectangles of texts outside of the two tables
no_tables = cv2.morphologyEx(no_tables, cv2.MORPH_CLOSE, np.full((21, 51), 255))
cnts = cv2.findContours(no_tables, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
cnts = cnts[0] if len(cnts) == 2 else cnts[1]
rects = sorted([cv2.boundingRect(cnt) for cnt in cnts], key=lambda r: r[1])

# Extract texts from each bounding rectangle
print('\nExtract texts outside of the two tables\n')
for (x, y, w, h) in rects:
    text = pytesseract.image_to_string(page_3[y:y+h, x:x+w],
                                       config='--psm 6', lang='Devanagari')
    text = text.replace('\n', '').replace('\f', '')
    print('x: {}, y: {}, text: {}'.format(x, y, text))

# STEP 2: Extract texts from inside of the two tables

rects = sorted([cv2.boundingRect(cnt) for cnt in cnts_tables],
               key=lambda r: r[1])

# Iterate each table
for i_r, (x, y, w, h) in enumerate(rects, start=1):

    # Find bounding rectangles of cells inside of the current table
    cnts = cv2.findContours(page_3[y+2:y+h-2, x+2:x+w-2],
                            cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
    cnts = cnts[0] if len(cnts) == 2 else cnts[1]
    inner_rects = sorted([cv2.boundingRect(cnt) for cnt in cnts],
                         key=lambda r: (r[1], r[0]))

    # Extract texts from each cell of the current table
    print('\nExtract texts inside table {}\n'.format(i_r))
    for (xx, yy, ww, hh) in inner_rects:

        # Set current coordinates w.r.t. full image
        xx += x
        yy += y

        # Get current cell
        cell = page_3[yy+2:yy+hh-2, xx+2:xx+ww-2]

        # For table 1, simply extract texts as-is
        if i_r == 1:
            text = pytesseract.image_to_string(cell, config='--psm 6',
                                               lang='Devanagari')
            text = text.replace('\n', '').replace('\f', '')
            print('x: {}, y: {}, text: {}'.format(xx, yy, text))

        # For table 2, extract single elements
        if i_r == 2:

            # Floodfill rectangles around numbers
            ys, xs = np.min(np.argwhere(cell == 0), axis=0)
            temp = cv2.floodFill(cell.copy(), None, (xs, ys), 255)[1]
            mask = cv2.floodFill(thr[yy+2:yy+hh-2, xx+2:xx+ww-2].copy(),
                                 None, (xs, ys), 0)[1]

            # Extract left (Hindi) and right (English) parts
            mask = cv2.morphologyEx(mask, cv2.MORPH_CLOSE,
                                    np.full((2 * hh, 5), 255))
            cnts = cv2.findContours(mask,
                                    cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_NONE)
            cnts = cnts[0] if len(cnts) == 2 else cnts[1]
            boxes = sorted([cv2.boundingRect(cnt) for cnt in cnts],
                           key=lambda b: b[0])

            # Extract texts from each part of the current cell
            for i_b, (x_b, y_b, w_b, h_b) in enumerate(boxes, start=1):

                # For the left (Hindi) part, extract Hindi texts
                if i_b == 1:

                    text = pytesseract.image_to_string(
                        temp[y_b:y_b+h_b, x_b:x_b+w_b],
                        config='--psm 6',
                        lang='Devanagari')
                    text = text.replace('\f', '')

                # For the left (English) part, extract English texts
                if i_b == 2:

                    text = pytesseract.image_to_string(
                        temp[y_b:y_b+h_b, x_b:x_b+w_b],
                        config='--psm 6',
                        lang='eng')
                    text = text.replace('\f', '')

                print('x: {}, y: {}, text:\n{}'.format(xx, yy, text))

以下是输出的前几行：

Extract texts outside of the two tables

x: 972, y: 93, text: राज्य निर्वाचन आयोग, राजस्थान
x: 971, y: 181, text: पंचायत चुनाव निर्वाचक नामावली, 2021
x: 166, y: 610, text: मिश्र का बाढ़ ,श्रीराम की नॉगल
x: 151, y: 3417, text: आयु 1 जनवरी 2021 के अनुसार
x: 778, y: 3419, text: पृष्ठ संख्या : 3 / 10

Extract texts inside table 1

x: 146, y: 240, text: जिलापरिषद का नाम : जयपुर
x: 1223, y: 240, text: जि° प° सदस्य निर्वाचन क्षेत्र : 21
x: 146, y: 327, text: पंचायत समिति का नाम : सांगानेर
x: 1223, y: 327, text: पं° स° सदस्य निर्वाचन क्षेत्र : 6
x: 146, y: 415, text: ग्रामपंचायत : बडी का बांस
x: 1223, y: 415, text: वार्ड क्रमांक : 2
x: 146, y: 502, text: विधानसभा क्षेत्र की संख्या एवं नाम:- 56-बगरु

Extract texts inside table 2

x: 142, y: 665, text:
1 RBP2469583
नाम: आरती चावला
पिता का नामःलाला राम चावला
मकान संख्याः १९
आयुः 21 लिंगः स्त्री

x: 142, y: 665, text:
Photo is
Available

x: 867, y: 665, text:
2 MRQ3101367
नामः सूरज देवी
पिता का नामःरामावतार
मकान संख्याः डी /18
आयुः 44 लिंगः स्त्री

x: 867, y: 665, text:
Photo is
Available

我使用逐字逐字的比较法检查了一些文本，觉得看起来相当不错，但由于无法理解印地语或阅读天城体字，我无法评论OCR的整体质量。请告诉我！

令人恼火的是，相应“卡片”中的数字9被错误地提取为2。我认为这是由于与其余文本相比使用了不同的字体，并结合使用lang='Devanagari'导致的。找不到解决办法 - 除非将矩形单独从“卡片”中提取。

----------------------------------------
System information
----------------------------------------
Platform:      Windows-10-10.0.19041-SP0
Python:        3.9.1
PyCharm:       2021.1.1
NumPy:         1.19.5
OpenCV:        4.5.2
pdf2image      1.14.0
pytesseract:   5.0.0-alpha.20201127
----------------------------------------