Tesseract OCR忽略“-”

Question

Tesseract OCR忽略“-”

iosobjective-ctesseract

3

在我的应用程序中，我正在从一个包含数字和用-分隔的字母的图像中读取文本。

例如 1-TT88TY5-AD5G

然而，Tesseract 忽略了 - 并给我了 1TT88TY5AD5G..

如何强制它也读取连字符..

这是我的初始代码..

Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
                       [tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];

- Shradha

3

你刚刚告诉Tesseract只接受英文字母和十进制数字，且不包括其他任何字符。 - user529758

是的，我理解这一点。但即使将减号添加到变量值设置中，它仍然无法正常工作。 - Shradha

2个回答

0

Tesseract无法准确识别您想要的内容。您必须多次测试Tesseract，然后根据其性能应用一些基于模式匹配的方法。

查看它返回的内容而不是-。最好用“-”替换Tesseract返回的内容而不是-。

在您的情况下，-被替换为.，这看起来不太好，因为您的白名单字符串不包含任何.。

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];

您可以使用以下方法来确定每个字符具有多少置信度值

  /** Returns the (average) confidence value between 0 and 100. */
  int MeanTextConf();
  /**
   * Returns all word confidences (between 0 and 100) in an array, terminated
   * by -1.  The calling function must delete [] after use.
   * The number of confidences should correspond to the number of space-
   * delimited words in GetUTF8Text.
   */
  int* AllWordConfidences();

- Bhumeshwer katre

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- James Webster · Accepted Answer

我只是猜测，因为我没有使用过Tesseract，但是 - 不应该在白名单中吗？

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];
                              ^