Pytesseract如何提高OCR准确性?

4
我想从一张图片中提取文本,使用的是Python语言,为此我选择了pytesseract。当我尝试从图像中提取文本时,结果并不理想。我还查阅了这篇文章并实现了其中列出的所有技术,但似乎效果并不好。
图片:

enter image description here

代码:
import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

输出:

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was 

就算有一个不必要的空格,对我来说也可能代价高昂。我希望结果百分之百准确。非常感谢您的帮助。

1个回答

7

我将 resize 从 1.2 改为了 2,并删除了所有的预处理。在 psm 11 和 psm 12 下取得了良好的结果。

import pytesseract
import cv2
import numpy as np

img = cv2.imread('wavy.png')

#  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
#  img = cv2.dilate(img, kernel, iterations=1)
#  img = cv2.erode(img, kernel, iterations=1)

#  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

cv2.imwrite('thresh.png', img)

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    
for psm in range(6,13+1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config = config, lang='eng')
    print('psm ', psm, ':',txt)

这行 config = '--oem 3 --psm %d' % psm 使用了字符串插值(%)操作符, 将 %d 替换为整数 (psm)。我不太确定 oem 的作用,但我已经养成了使用它的习惯。有关 psm 的更多信息请参见本答案结尾。

psm  11 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm  12 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm是页面分割模式(page segmentation mode)的缩写。我不确定有哪些不同的模式,但从描述中可以感受到这些代码的含义。您可以从tesseract --help-psm获取列表。

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

太酷了!小小的请求。你能解释一下psm是什么吗?config = '--oem 3 --psm %d' % psm是什么意思? - Sushil
如果您认为我的问题很好,并且表述清晰,请考虑给我的问题点赞。谢谢! - Sushil
OEM 是引擎模式,--oem 3 运行默认设置。如果您使用 --help 运行 Tesseract 可执行文件,则可以查看所有选项的完整列表。此页面也非常有帮助:https://github.com/tesseract-ocr/tesseract/blob/main/doc/tesseract.1.asc#config-files-and-augmenting-with-user-data。 - Edward Spencer

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接