Spacy词形还原器是如何工作的？

Question

Spacy词形还原器是如何工作的？

15

对于词形还原，spacy有一个单词列表：形容词、副词、动词... 还有例外的列表：adverbs_irreg... 对于常规的单词，有一组规则。

我们以单词"wider"为例。

由于它是形容词，因此应该从这个列表中选择词形还原规则：

ADJECTIVE_RULES = [
    ["er", ""],
    ["est", ""],
    ["er", "e"],
    ["est", "e"]
]

据我了解，这个过程会像这样：

1）获取单词的词性标注，以确定它是名词、动词等
2）如果该单词在不规则变化列表中，则直接替换；否则应用其中一条规则。

现在，如何决定使用“er” -> “e”而不是“er” -> “”，从而得到“wide”而不是“wid”？

在这里可以进行测试。

- Luis Ramon Ramirez Rodriguez

3个回答

9

简述：spaCy会检查它试图生成的词形归并是否在该词性已知的单词或例外列表中。

详细解释：

请查看lemmatizer.py文件，特别是底部的lemmatize函数。

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

例如，对于英语形容词，它接受我们正在评估的字符串、已知形容词的索引、异常和规则（如您所引用的）从this directory（适用于英语模型）中获取。

在将字符串转换为小写后，在lemmatize中的第一件事是检查字符串是否在我们已知的异常列表中，其中包括像“worse”->“bad”这样的单词的词形规则。

然后，我们遍历我们的rules，如果适用，则将每个规则应用于字符串。对于单词wider，我们将应用以下规则：

["er", ""],
["est", ""],
["er", "e"],
["est", "e"]

我们将输出以下形式：["wid", "wide"]。

随后，我们检查这个形式是否在已知形容词的index中。如果是，我们就将其添加到形式中。否则，我们将其添加到oov_forms中，我猜这是指未收录的词汇。由于wide在索引中，因此它被添加了进去。wid被添加到oov_forms中。

最后，我们返回一个集合，其中包含找到的词元、与规则匹配但不在我们的索引中的任何词元，或者只是单词本身。

你上面发布的单词-词元链接对于wider有效，因为wide在单词索引中。尝试输入类似于He is blandier than I.的内容。spaCy会将blandier（我编造的词）标记为形容词，但它没有在索引中，因此它将返回blandier作为词元。

- Amrit Saini

4

每种词性（形容词、名词、动词、副词）都有一组规则和一组已知的单词。它们之间的映射在这里进行：

INDEX = {
    "adj": ADJECTIVES,
    "adv": ADVERBS,
    "noun": NOUNS,
    "verb": VERBS
}


EXC = {
    "adj": ADJECTIVES_IRREG,
    "adv": ADVERBS_IRREG,
    "noun": NOUNS_IRREG,
    "verb": VERBS_IRREG
}


RULES = {
    "adj": ADJECTIVE_RULES,
    "noun": NOUN_RULES,
    "verb": VERB_RULES,
    "punct": PUNCT_RULES
}

然后在 lemmatizer.py 的这一行中，正确的索引、规则和 exc（我相信 excl 代表例外，例如不规则的例子）被加载：

lemmas = lemmatize(string, self.index.get(univ_pos, {}),
                   self.exc.get(univ_pos, {}),
                   self.rules.get(univ_pos, []))

所有剩余的逻辑都在函数lemmatize中，而且非常简短。我们执行以下操作：

如果有异常（即该单词不规则），包括提供的字符串，则使用它并将其添加到词形还原形式
对于所选单词类型中按其给定顺序的每个规则，检查是否与给定单词匹配。如果匹配，则尝试应用它。

2a. 如果应用规则后单词在已知单词列表（即索引）中，则将其添加到单词的词形还原形式中

2b. 否则，将该单词添加到名为oov_forms的单独列表中（这里我相信oov表示“词汇表外”）
如果我们使用上述规则找到至少一种形式，则返回找到的形式列表；否则返回oov_forms列表。

- Ivaylo Strandjev

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alvas · Accepted Answer

让我们从类定义开始：https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py

类

它首先初始化了3个变量：

class Lemmatizer(object):
    @classmethod
    def load(cls, path, index=None, exc=None, rules=None):
        return cls(index or {}, exc or {}, rules or {})

    def __init__(self, index, exceptions, rules):
        self.index = index
        self.exc = exceptions
        self.rules = rules

现在，看一下英语中的self.exc，我们可以看到它指向https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py，从那里加载目录https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer中的文件。

为什么Spacy不直接读取文件呢？

很可能是因为在代码中声明字符串比通过I/O流传输字符串更快。

这些索引、异常和规则是从哪里来的？

仔细看，它们似乎都来自原始的普林斯顿 WordNet https://wordnet.princeton.edu/man/wndb.5WN.html。

规则

进一步观察，https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py 上的规则与 nltk https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749 中的 _morphy 规则相似。

这些规则最初来自于 Morphy 软件 https://wordnet.princeton.edu/man/morphy.7WN.html。

此外，spacy 还包含了一些不是来自普林斯顿 Morphy 的标点符号规则。

PUNCT_RULES = [
    ["“", "\""],
    ["”", "\""],
    ["\u2018", "'"],
    ["\u2019", "'"]
]

异常情况

至于异常情况，它们存储在*_irreg.py文件中，位于spacy中，看起来也来自普林斯顿词网。

如果我们查看一些原始WordNet .exc（排除）文件的镜像（例如https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc），并从nltk下载wordnet包，我们可以看到这是同一个列表：

alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc       cntlist.rev  data.noun  index.adv    index.verb  noun.exc
adv.exc       data.adj     data.verb  index.noun   lexnames    README
citation.bib  data.adv     index.adj  index.sense  LICENSE     verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc 
1490 adj.exc

索引

如果我们查看spacy词形还原器的索引，我们会发现它也来自Wordnet，例如https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py以及nltk中重新分配的wordnet副本：

alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj 

  1 This software and database is being provided to you, the LICENSEE, by  
  2 Princeton University under the following license.  By obtaining, using  
  3 and/or copying this software and database, you agree that you have  
  4 read, understood, and will comply with these terms and conditions.:  
  5   
  6 Permission to use, copy, modify and distribute this software and  
  7 database and its documentation for any purpose and without fee or  
  8 royalty is hereby granted, provided that you agree to comply with  
  9 the following copyright notice and statements, including the disclaimer,  
  10 and that the same appear on ALL copies of the software, database and  
  11 documentation, including modifications that you make for internal  
  12 use or for distribution.  
  13   
  14 WordNet 3.0 Copyright 2006 by Princeton University.  All rights reserved.  
  15   
  16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
  17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
  18 IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
  19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
  20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
  21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
  22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
  23 OTHER RIGHTS.  
  24   
  25 The name of Princeton University or Princeton may not be used in  
  26 advertising or publicity pertaining to distribution of the software  
  27 and/or database.  Title to copyright in this software, database and  
  28 any associated documentation shall at all times remain with  
  29 Princeton University and LICENSEE agrees to preserve same.  
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"  
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"  
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"  
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"  
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex  
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base  
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part  
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part  
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 |  being born or beginning; "the nascent chicks"; "a nascent insurgency"   
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"  
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels

基于字典、异常和规则，spacy词形还原器的使用主要来自普林斯顿WordNet和他们的Morphy软件。接下来我们可以看到实际的实现，即spacy如何使用索引和异常应用规则。

我们返回https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py。

主要操作来自函数而不是Lemmatizer类：

def lemmatize(string, index, exceptions, rules):
    string = string.lower()
    forms = []
    # TODO: Is this correct? See discussion in Issue #435.
    #if string in index:
    #    forms.append(string)
    forms.extend(exceptions.get(string, []))
    oov_forms = []
    for old, new in rules:
        if string.endswith(old):
            form = string[:len(string) - len(old)] + new
            if not form:
                pass
            elif form in index or not form.isalpha():
                forms.append(form)
            else:
                oov_forms.append(form)
    if not forms:
        forms.extend(oov_forms)
    if not forms:
        forms.append(string)
    return set(forms)

为什么`lemmatize`方法在`Lemmatizer`类外面？

我不确定，但可能是为了确保可以在类实例之外调用词形还原函数，但考虑到@staticmethod和@classmethod的存在，也许还有其他考虑因素导致函数和类被解耦。

Morphy vs Spacy

将spacy的lemmatize()函数与nltk中的morphy()函数进行比较（后者最初来自于十多年前创建的http://blog.osteele.com/2004/04/pywordnet-20/），morphy()是Oliver Steele的Python版WordNet morphy的主要过程：

检查异常列表
对输入应用规则，得到y1、y2、y3等
返回数据库中的所有匹配项（并检查原始输入）
如果没有匹配项，则继续应用规则直到找到匹配项
如果找不到任何内容，则返回空列表

对于spacy，可能仍处于开发阶段，因为在https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76行有TODO。

但是一般的流程似乎是：

寻找异常情况，如果单词在异常列表中，则获取其词形还原形式。
应用规则
保存在索引列表中的结果
如果步骤1-3都没有找到词形还原形式，则只需跟踪不在词汇表中的单词（OOV），并将原始字符串附加到词形还原形式中
返回词形还原形式

就OOV处理而言，如果找不到词形还原的形式，spacy会返回原始字符串，在这方面，nltk实现的morphy也是一样的，例如：

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'

在词形还原之前检查是否为不定式

可能另一个区别点是morphy和spacy如何决定将哪个POS分配给该单词。在这方面，spacy在中放置了一些语言学规则来决定单词是否为基本形式，并且如果单词已经是不定式形式（is_base_form（）），则跳过词形还原, 如果要对语料库中的所有单词进行词形还原，并且其中相当一部分是不定式（已经是词根形式），这将节省相当多的时间。

但是，在spacy中，这是可能的，因为它允许词形还原器访问与某些形态学规则紧密相关的POS。虽然对于morphy，虽然使用细粒度的PTB POS标记可以找出一些形态，但仍需要花费一些力气来将它们分类以知道哪些形式是不定式。

通常，需要从POS标记中挖掘出3个主要的形态特征信号：

人
数字
性别

已更新

SpaCy在最初的回答（17年5月12日）后对其词形还原器进行了更改。我认为目的是使词形还原更快，无需查找和规则处理。

因此，他们预先对单词进行了词形还原，并将它们留在查找哈希表中，以使检索O(1)适用于他们预先词形还原的单词 https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py

此外，为了统一不同语言的词形还原器，词形还原器现在位于 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92

但上述底层词形还原步骤仍与当前版本的spacy（4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc）相关

结语

我想现在我们知道了它是如何遵循语言学规则等方面工作的，另一个问题是"是否有基于非规则的词形还原方法？"

但在回答这个问题之前，“什么是词元？”可能是更好的问题。

Spacy词形还原器是如何工作的？

类

为什么Spacy不直接读取文件呢？

这些索引、异常和规则是从哪里来的？

为什么lemmatize方法在Lemmatizer类外面？

Morphy vs Spacy

在词形还原之前检查是否为不定式

已更新

结语

为什么`lemmatize`方法在`Lemmatizer`类外面？