Python 2.7中的图像转文本 - 删除非ASCII字符

Question

Python 2.7中的图像转文本 - 删除非ASCII字符

pythonimage-processingocrtesseractpython-tesseract

3

我正在使用pytesser对一张小图像进行OCR，并从中获取一个字符串：

image= Image.open(ImagePath)
text = image_to_string(image)
print text

然而，pytesser有时候会识别并返回非ascii字符，当我想要打印刚刚识别出来的内容时，就会出现问题，在我使用的python 2.7版本中，程序会崩溃。

有没有办法让pytesser不返回任何非ascii字符呢？也许在tesseract OCR中有一些可以更改的东西吗？

还是有没有一种方法来测试一个字符串是否包含非ascii字符（而不会导致程序崩溃），然后仅仅不打印这一行内容？

有人建议使用python 3.4，但根据我的研究，似乎pytesser不适用于它：Python 3.4中的Pytesser：name 'image_to_string'未定义?

- Micro

2个回答

0

有没有办法让pytesser不返回任何非ASCII字符？

您可以使用选项“tessedit_char_whitelist”限制Tesseract可识别的字符。

例如：

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

参见：https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

- Giovanni Cappellotto

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Fabio Menegazzo · Accepted Answer

我会选择Unidecode。这个库可以将非ASCII字符转换为最相似的ASCII表示。

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

它应该完美地工作！