使用NLTK对OCR中未分割的单词进行分词

Question

使用NLTK对OCR中未分割的单词进行分词

5

我正在使用NLTK处理从PDF文件中提取的一些文本。我可以基本完整地恢复文本，但有许多实例中单词之间的空格没有被捕获，因此我得到像ifI而不是if I，或者thatposition而不是that position，或者andhe's而不是and he's这样的单词。

我的问题是：我如何使用NLTK查找它无法识别/尚未学习的单词，并查看是否有“附近”的单词组合更有可能发生？实现这种检查的更加优雅的方法是否存在，而不仅仅是逐个字符地遍历未识别的单词，分割它并查看它是否构成两个可识别的单词？

- charlesreid1

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Justin O Barber · Accepted Answer

我建议您考虑使用pyenchant，因为它是这种问题的更强大的解决方案。您可以在此处下载pyenchanthere。安装后，以下是获取结果的示例：

>>> text = "IfI am inthat position, Idon't think I will."  # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
    for suggestion in error.suggest():
        if error.word.replace(' ', '') == suggestion.replace(' ', ''):  # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
            error.replace(suggestion)
            break
>>> checker.get_text()
"If I am in that position, I don't think I will."  # text is now fixed