如何在 tesseract 中打印中文字符的识别结果

Question

如何在 tesseract 中打印中文字符的识别结果

4

我正在尝试使用Tesseract识别中文，并且它可以工作。唯一的问题是，结果不是以中文字符打印，而是以拼音（用英语键入中文单词的方式）打印出来。

# Import libraries
from PIL import Image
import pytesseract
from unidecode import unidecode

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

image_counter = 2

filelimit = image_counter - 1

outfile = "out_text.txt"

f = open(outfile, "a")

for i in range(1, filelimit + 1):
    print("ran")
    filename = "page_" + str(i) + ".png"

    # Recognize the text as string in image using pytesserct
    text = unidecode(((pytesseract.image_to_string(Image.open(filename), lang = "chi_sim"))))

    print(text)

这是我运行的图片

这是我得到的结果

然清明时节雨纷纷，路上行人欲断魂。借问酒家何处有？牧童遥指杏花村。

结果应该与图片中显示的中文字符相同。

- Bolwic

可能是使用Python进行Pytesseract外语提取的重复问题。 - Nouman

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bolwic · Accepted Answer

没关系，我解决了我的问题。

text = unidecode(((pytesseract.image_to_string(Image.open(filename), lang = "chi_sim"))))

应该是这样的。

text = pytesseract.image_to_string(Image.open(filename), lang = "chi_tra")