Tesseract - 无法识别希腊字母

Question

Tesseract - 无法识别希腊字母

pythonocrtesseracttraining-datapython-tesseract

7

我正在尝试从图像中自动提取比例尺（比例尺条+数字+单位）。这是一个例子：

它用于将像素映射到现实世界的测量。

我正在使用通过Anaconda3安装的PyTesseract。

这是我的代码:

import cv2
import pytesseract
import numpy as np

img = cv2.imread('pbmk_scale.tif')
#img = cv2.imread('ocr_test_greek_and_english.png')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
blur = cv2.GaussianBlur(gray, (3,3), 0)
thresh = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)[1]

# Morph open to remove noise and invert image
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3,3))
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=1)
invert = 255 - opening

# Line detection for the scale line
edges = cv2.Canny(gray,50,150,apertureSize = 3)
minLineLength = 100
maxLineGap = 10
lines = cv2.HoughLinesP(edges,1,np.pi/180,100,minLineLength,maxLineGap)
x1,y1,x2,y2 = lines[0][0]
print('Line (' + str(x1) + ',' + str(y1) + ') -- (' + str(x2) + ',' + str(y2) + ')')
# Calculating lenght of scale line in pixels. Since the line is always horizontal we need to just subtract the X coordinates
l = abs(x1 - x2)
print('Line is ' + str(l) + 'px long')

# Text recognition for the scale number and real unit
# FIXME Greek not detected. Is it grc or ell for the configuration? Both don't work
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))
text = pytesseract.image_to_string(img, config=custom_config)
print('OUTPUT:', text.split())
number = [int(s) for s in text.split() if s.isdigit()]
print('Number is ' + str(number))

到目前为止，它的表现相当不错，特别是因为图像是通过氦离子显微镜生成的，标签（比例尺所在位置）是自动生成并与TIFF一起存储的。因此，文字和线条的检测非常准确。此外，比例尺始终位于图像的同一位置，并且实际比例线始终水平。上述代码存在缺陷，但我更关心的是我无法检测除英语以外的任何内容。

可悲的是，当涉及到提供的软件包的说明时（特别是如果您查看导航器），Anaconda非常加密。所以我做了一些研究，在

C：\ Users \ USER_NAME \ anaconda3 \ envs \ MachineLearning \ tessdata 下（其中 MachineLearning 是我的自定义虚拟环境），我找到了两个东西：


只有两个 .traineddata 文件-  eng.traineddata 和 osd.traineddata  
  eng.traineddata 比我在Tesseract项目的 git repo 中找到的一个要小得多，托管在 GitHub 上。


我下载了多个已训练的数据文件（eng、ell和grc）。我进行了一次测试，只使用了grc和ell（分别加上组合），以及同时在图像中使用希腊语和英语。例如，在上面的代码中删除了行检测部分后，得到以下图像：



产生以下结果：
OUTPUT: ['Here’s', 'some', 'GBeek', 'Od10', 'd1ota', 'iumEedit', 'Oy']


我尝试了各种有意义的PSM参数值，但没有任何变化。

我对OCR和Tesseract很新，所以我可能忽略了一些非常明显的东西。

- rbaleksandar

2个回答

0

尝试使用PNG图像和以下代码：

from PIL import Image
from pytesseract import *

img_path = r'your image path'
tessdata_dir_config = r'C:\Program Files\Tesseract-OCR\tessdata'
language = 'grc'

def process_image(iamge_name, lang_code, tessdata_dir_config):
    return pytesseract.image_to_string(Image.open(iamge_name), lang=lang_code, config=tessdata_dir_config)

def print_data(data):
    print(data)

def output_file(filename, data):
    file = open(filename, "w+")
    file.write(data)
    file.close()

def main():
    data_gr = process_image(img_path, language, tessdata_dir_config)
    print_data(data_gr)
    #output_file('my_ocr', data_gr)

if  __name__ == '__main__':
    main()

- Cpt_Nemos

这怎么回答 OP 提出的问题？输出是什么？ - LoneWanderer

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Antonio Abrantes · Accepted Answer

缺少指定 tesseract 可执行文件的命令

pytesseract.pytesseract.tesseract_cmd = 'D:/Tesseract-OCR/tesseract.exe'
custom_config = r'-l grc+eng --psm 1' # Greek (for mu and nu letters) and English (for m (metre))