Pytesseract如何提高OCR准确性？

Question

Pytesseract如何提高OCR准确性？

4

我想从一张图片中提取文本，使用的是Python语言，为此我选择了pytesseract。当我尝试从图像中提取文本时，结果并不理想。我还查阅了这篇文章并实现了其中列出的所有技术，但似乎效果并不好。

图片：

代码：

import pytesseract
import cv2
import numpy as np

img = cv2.imread('D:\\wordsimg.png')

img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
img = cv2.dilate(img, kernel, iterations=1)
img = cv2.erode(img, kernel, iterations=1)

img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files\\Tesseract-OCR\\tesseract.exe'
    
txt = pytesseract.image_to_string(img ,lang = 'eng')

txt = txt[:-1]

txt = txt.replace('\n',' ')

print(txt)

输出：

t hose he large form might light another us should took mountai house n story important went own own thought girl over family look some much ask the under why miss point make mile grow do own school was

就算有一个不必要的空格，对我来说也可能代价高昂。我希望结果百分之百准确。非常感谢您的帮助。

- Sushil

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bfris · Accepted Answer

我将 resize 从 1.2 改为了 2，并删除了所有的预处理。在 psm 11 和 psm 12 下取得了良好的结果。

import pytesseract
import cv2
import numpy as np

img = cv2.imread('wavy.png')

#  img = cv2.resize(img, None, fx=1.2, fy=1.2, interpolation=cv2.INTER_CUBIC)
img = cv2.resize(img, None, fx=2, fy=2)

img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

kernel = np.ones((1,1), np.uint8)
#  img = cv2.dilate(img, kernel, iterations=1)
#  img = cv2.erode(img, kernel, iterations=1)

#  img = cv2.threshold(cv2.medianBlur(img, 3), 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

cv2.imwrite('thresh.png', img)

pytesseract.pytesseract.tesseract_cmd = 'C:\\Program Files (x86)\\Tesseract-OCR\\tesseract.exe'
    
for psm in range(6,13+1):
    config = '--oem 3 --psm %d' % psm
    txt = pytesseract.image_to_string(img, config = config, lang='eng')
    print('psm ', psm, ':',txt)

这行 config = '--oem 3 --psm %d' % psm 使用了字符串插值(%)操作符，将 %d 替换为整数 (psm)。我不太确定 oem 的作用，但我已经养成了使用它的习惯。有关 psm 的更多信息请参见本答案结尾。

psm  11 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm  12 : those he large form might light another us should name

took mountain story important went own own thought girl

over family look some much ask the under why miss point

make mile grow do own school was

psm是页面分割模式（page segmentation mode）的缩写。我不确定有哪些不同的模式，但从描述中可以感受到这些代码的含义。您可以从tesseract --help-psm获取列表。

Page segmentation modes:
  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR. (not implemented)
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.