Python的拼写检查器

Question

Python的拼写检查器

pythonpython-2.7nltkspell-checkingpyenchant

66

我对Python和NLTK相当新。我正在开发一个可以执行拼写检查（用正确的单词替换错误拼写的单词）的应用程序。我目前在Python 2.7上使用Enchant库、PyEnchant和NLTK库。下面的代码是处理更正/替换的类。

from nltk.metrics import edit_distance

class SpellingReplacer:
    def __init__(self, dict_name='en_GB', max_dist=2):
        self.spell_dict = enchant.Dict(dict_name)
        self.max_dist = 2

    def replace(self, word):
        if self.spell_dict.check(word):
            return word
        suggestions = self.spell_dict.suggest(word)

        if suggestions and edit_distance(word, suggestions[0]) <= self.max_dist:
            return suggestions[0]
        else:
            return word

我编写了一个函数，它接收一个单词列表并对每个单词执行replace()操作，然后返回那些拼写正确的单词列表。

def spell_check(word_list):
    checked_list = []
    for item in word_list:
        replacer = SpellingReplacer()
        r = replacer.replace(item)
        checked_list.append(r)
    return checked_list

>>> word_list = ['car', 'colour']
>>> spell_check(words)
['car', 'color']

现在，我不太喜欢这样，因为它不是非常准确，我正在寻找一种实现单词拼写检查和替换的方法。我还需要一些东西来捕捉"caaaar"这样的拼写错误？有更好的拼写检查方法吗？如果有，它们是什么？Google是如何做到的？因为他们的拼写建议非常好。

有什么建议吗？

- Mike Barnes

12个回答

39

我建议首先仔细阅读Peter Norvig的这篇文章。（我之前也做过类似的事情，发现它非常有用。）

特别是下面的函数所包含的思路，现在可以让你的拼写检查器变得更加复杂：分裂、删除、转置和插入不规则单词来“纠正”它们。

def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

注意：以上是Norvig的拼写纠正器中的一段代码示例。

好消息是，您可以逐步添加和不断改进您的拼写检查器。

希望这有所帮助。

- Ram Narasimhan

13

这里提供一个开源、独立于语言的可训练拼写检查器——SymSpell，它优于 Norvig 的方法，并可以在多种编程语言中使用。 - Renel Chesak

30

在Python中进行拼写检查的最佳方式是：SymSpell、Bk-Tree或Peter Novig的方法。

其中，SymSpell是最快的。

这是Method1：参考链接pyspellchecker

该库基于Peter Norvig的实现。

pip install pyspellchecker

from spellchecker import SpellChecker

spell = SpellChecker()

# find those words that may be misspelled
misspelled = spell.unknown(['something', 'is', 'hapenning', 'here'])

for word in misspelled:
    # Get the one `most likely` answer
    print(spell.correction(word))

    # Get a list of `likely` options
    print(spell.candidates(word))

方法2：SymSpell Python

pip install -U symspellpy

- Shaurya Uppal

1

至少对于Python3来说，索引器已经被弃用，这会导致当前的拼写检查模块（pyspellchecker）出现问题。 - Justapigeon

pyspellchecker非常缓慢且会去掉标点符号（但可以在Python 3.6上运行）。 - duhaime

9

也许现在有些晚了，但我会回答未来的搜索结果。要进行拼写纠正，首先需要确保单词不是荒谬或俚语，例如：caaaar、amazzzing等连续字母。因此，我们首先需要去除这些字母，众所周知英语单词通常最多只有两个相同字母，例如：hello。因此，我们首先删除单词中的额外重复字母，然后再检查其拼写。要去除额外字母，可以使用Python中的正则表达式模块。完成此操作后，使用Python的Pyspellchecker库进行拼写更正。有关实现，请访问此链接：https://rustyonrampage.github.io/text-mining/2017/11/28/spelling-correction-with-python-and-nltk.html

- Rishabh Sahrawat

1

删除具有超过2个重复字母的单词并不是一个好主意。（哦，我刚刚拼错了“letters”）。 - Hamid Bazargani

10

我没有说要删除整个单词，我是描述要从单词中删除额外的字母。所以，“lettters”变成“letters”。请认真再次阅读答案。 - Rishabh Sahrawat

4

在终端中

pip install gingerit

关于编程

from gingerit.gingerit import GingerIt
text = input("Enter text to be corrected")
result = GingerIt().parse(text)
corrections = result['corrections']
correctText = result['result']

print("Correct Text:",correctText)
print()
print("CORRECTIONS")
for d in corrections:
  print("________________")  
  print("Previous:",d['text'])  
  print("Correction:",d['correct'])   
  print("`Definiton`:",d['definition'])

- pouya barari

链接：https://pypi.org/project/gingerit/ 91颗星 - Att Righ

3

尝试使用JamSpell - 它非常适用于自动拼写纠正：

import jamspell

corrector = jamspell.TSpellCorrector()
corrector.LoadLangModel('en.bin')

corrector.FixFragment('Some sentnec with error')
# u'Some sentence with error'

corrector.GetCandidates(['Some', 'sentnec', 'with', 'error'], 1)
# ('sentence', 'senate', 'scented', 'sentinel')

- Fippo

你在Windows机器上使用过它吗？ - Hunaidkhan

在Mac上安装时遇到了问题：https://github.com/bakwc/JamSpell/issues/73#issuecomment-1152979889。看起来需要手动安装swig...我不太想这样做。 - Att Righ

2

您可以尝试以下方法：

pip install textblob

这将安装textblob。

from textblob import TextBlob
txt="machne learnig"
b = TextBlob(txt)
print("after spell correction: "+str(b.correct()))

拼写纠正后：机器学习

- Mayur Patil

链接：https://pypi.org/project/textblob/，拥有数千颗星。请注意，这是一个通用的自然语言处理库。 - Att Righ

3

TextBlob("Felo world. Now are u doing today?")```

这不是我想要的。

- Att Righ

1

使用pip安装scuse

from scuse import scuse

obj = scuse()

checkedspell = obj.wordf("spelling you want to check")

print(checkedspell)

- mrithul e

1

拼写纠正器->

如果您将语料库存储在其他地方，需要将其导入到桌面上，并在代码中更改路径。我还使用了tkinter添加了一些图形界面，仅用于处理非单词错误！

def min_edit_dist(word1,word2):
    len_1=len(word1)
    len_2=len(word2)
    x = [[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance
    for i in range(0,len_1+1):  
        #initialization of base case values
        x[i][0]=i
        for j in range(0,len_2+1):
            x[0][j]=j
    for i in range (1,len_1+1):
        for j in range(1,len_2+1):
            if word1[i-1]==word2[j-1]:
                x[i][j] = x[i-1][j-1]
            else :
                x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1
    return x[i][j]
from Tkinter import *


def retrieve_text():
    global word1
    word1=(app_entry.get())
    path="C:\Documents and Settings\Owner\Desktop\Dictionary.txt"
    ffile=open(path,'r')
    lines=ffile.readlines()
    distance_list=[]
    print "Suggestions coming right up count till 10"
    for i in range(0,58109):
        dist=min_edit_dist(word1,lines[i])
        distance_list.append(dist)
    for j in range(0,58109):
        if distance_list[j]<=2:
            print lines[j]
            print" "   
    ffile.close()
if __name__ == "__main__":
    app_win = Tk()
    app_win.title("spell")
    app_label = Label(app_win, text="Enter the incorrect word")
    app_label.pack()
    app_entry = Entry(app_win)
    app_entry.pack()
    app_button = Button(app_win, text="Get Suggestions", command=retrieve_text)
    app_button.pack()
    # Initialize GUI loop
    app_win.mainloop()

- ishaan arora

1

from autocorrect import spell

你需要安装这个库，最好使用Anaconda，而且它只能检查单词而不是句子，这是你将要面对的限制。

from autocorrect import spell
print(spell('intrerpreter'))
# output: interpreter

- Saurabh Tripathi

请参考以下回答的问题：https://dev59.com/OmYr5IYBdhLWcg3wIG5u#48280566 - Att Righ

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rakesh · Accepted Answer

59

您可以使用autocorrect库在Python中进行拼写检查。
示例用法：

from autocorrect import Speller

spell = Speller(lang='en')

print(spell('caaaar'))
print(spell('mussage'))
print(spell('survice'))
print(spell('hte'))

结果:

caesar
message
service
the

- Rakesh

1

print(spell('Stanger things')) 输出 Stenger things - Gagan

这似乎不符合Python-3的规范？spell = Speller(lang='en')会抛出TypeError：JSON对象必须是str，而不是'bytes'。 - duhaime

10

这个库不值得信任。在100个相对常见的单词中，有6个被自动更正成了另一个单词：sardine（沙丁鱼）-> marine（海洋的），stewardess（空姐）-> stewards（乘务员），snob（势利小人）-> snow（雪），crutch（拐杖）-> clutch（离合器），pelt（毛皮）-> felt（毡子），toaster（烤面包机）-> coaster（杯垫）。 - fredcallaway

哪个更好，pyspellchecker还是autocorrect？ - Sunil Garg

1

这是一个相当糟糕的结果。例如，caaaar 应该被解释为 car 并修剪多余字符并重新检查语法。Mussage 在发音上与 massage 更相似，而不是 message，正如其他评论所建议的那样。 - WASasquatch

只是在这里评论一下，这个语料库一定很小，甚至连“afterwards”都识别不了，所以它没什么用。 - undefined