OCR介绍

Question

OCR介绍

ocr

7

有人给了我一个充满惊人信息的宝库，里面有200MB的.tiff扫描公告图片，可以追溯到40年代。我想将其数字化，但是我对OCR一无所知。早期的一些材料甚至对人类来说都难以阅读，更别说机器了。而且它是用希伯来语写的。

我正在寻求如何处理这个问题的建议。关于书籍、文章、代码库或软件的好建议（所有这些都应该在网上免费提供）。我熟练掌握C++和Python，并且如果需要的话可以学习另一种语言。

谢谢。

- CamelCamelCamel

你需要它可以被搜索吗？如果不需要，最好保持原样，因为这样可能更有用。我从未见过针对英语的_优秀_OCR（尽管有些接近）；我想扫描希伯来语的误差率会更高。 - Michael Todd

1

如果代码难以被人类阅读，那么机器能够读懂的可能性就很小。 - Matt Ball

从40年代早期到60年代的东西可能无法被机器读取，但至少从70年代到现在的所有内容应该可以。 - CamelCamelCamel

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Matt Ball · Accepted Answer

这听起来是Python和OCR库的好任务。快速的谷歌搜索找到了pytesser： pytesser。

PyTesser is an Optical Character Recognition module for Python. It takes as input an image or image file and outputs a string.

PyTesser uses the Tesseract OCR engine, converting images to an accepted format and calling the Tesseract executable as an external script. A Windows executable is provided along with the Python scripts. The scripts should work in other operating systems as well.

...

Usage Example
>>> from pytesser import *
>>> image = Image.open('fnord.tif')  # Open image object using PIL
>>> print image_to_string(image)     # Run tesseract.exe on image
fnord
>>> print image_file_to_string('fnord.tif')
fnord