让我们从类定义开始:https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py
类
它首先初始化了3个变量:
class Lemmatizer(object):
@classmethod
def load(cls, path, index=None, exc=None, rules=None):
return cls(index or {}, exc or {}, rules or {})
def __init__(self, index, exceptions, rules):
self.index = index
self.exc = exceptions
self.rules = rules
现在,看一下英语中的
self.exc
,我们可以看到它指向
https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/init.py,从那里加载目录
https://github.com/explosion/spaCy/tree/master/spacy/en/lemmatizer中的文件。
为什么Spacy不直接读取文件呢?
很可能是因为在代码中声明字符串比通过I/O流传输字符串更快。
这些索引、异常和规则是从哪里来的?
仔细看,它们似乎都来自原始的普林斯顿 WordNet https://wordnet.princeton.edu/man/wndb.5WN.html。
规则
进一步观察,https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_lemma_rules.py 上的规则与 nltk
https://github.com/nltk/nltk/blob/develop/nltk/corpus/reader/wordnet.py#L1749 中的 _morphy
规则相似。
这些规则最初来自于
Morphy
软件
https://wordnet.princeton.edu/man/morphy.7WN.html。
此外,
spacy
还包含了一些不是来自普林斯顿 Morphy 的标点符号规则。
PUNCT_RULES = [
["“", "\""],
["”", "\""],
["\u2018", "'"],
["\u2019", "'"]
]
异常情况
至于异常情况,它们存储在*_irreg.py
文件中,位于spacy
中,看起来也来自普林斯顿词网。
如果我们查看一些原始WordNet .exc
(排除)文件的镜像(例如https://github.com/extjwnl/extjwnl-data-wn21/blob/master/src/main/resources/net/sf/extjwnl/data/wordnet/wn21/adj.exc),并从nltk
下载wordnet
包,我们可以看到这是同一个列表:
alvas@ubi:~/nltk_data/corpora/wordnet$ ls
adj.exc cntlist.rev data.noun index.adv index.verb noun.exc
adv.exc data.adj data.verb index.noun lexnames README
citation.bib data.adv index.adj index.sense LICENSE verb.exc
alvas@ubi:~/nltk_data/corpora/wordnet$ wc -l adj.exc
1490 adj.exc
索引
如果我们查看spacy
词形还原器的索引
,我们会发现它也来自Wordnet,例如https://github.com/explosion/spaCy/tree/develop/spacy/lang/en/lemmatizer/_adjectives.py以及nltk
中重新分配的wordnet副本:
alvas@ubi:~/nltk_data/corpora/wordnet$ head -n40 data.adj
1 This software and database is being provided to you, the LICENSEE, by
2 Princeton University under the following license. By obtaining, using
3 and/or copying this software and database, you agree that you have
4 read, understood, and will comply with these terms and conditions.:
5
6 Permission to use, copy, modify and distribute this software and
7 database and its documentation for any purpose and without fee or
8 royalty is hereby granted, provided that you agree to comply with
9 the following copyright notice and statements, including the disclaimer,
10 and that the same appear on ALL copies of the software, database and
11 documentation, including modifications that you make for internal
12 use or for distribution.
13
14 WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved.
15
16 THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON
17 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
18 IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON
19 UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-
20 ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
21 OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT
22 INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR
23 OTHER RIGHTS.
24
25 The name of Princeton University or Princeton may not be used in
26 advertising or publicity pertaining to distribution of the software
27 and/or database. Title to copyright in this software, database and
28 any associated documentation shall at all times remain with
29 Princeton University and LICENSEE agrees to preserve same.
00001740 00 a 01 able 0 005 = 05200169 n 0000 = 05616246 n 0000 + 05616246 n 0101 + 05200169 n 0101 ! 00002098 a 0101 | (usually followed by `to') having the necessary means or skill or know-how or authority to do something; "able to swim"; "she was able to program her computer"; "we were at last able to buy a car"; "able to get a grant for the project"
00002098 00 a 01 unable 0 002 = 05200169 n 0000 ! 00001740 a 0101 | (usually followed by `to') not having the necessary means or skill or know-how; "unable to get to town without a car"; "unable to obtain funds"
00002312 00 a 02 abaxial 0 dorsal 4 002 ;c 06037666 n 0000 ! 00002527 a 0101 | facing away from the axis of an organ or organism; "the abaxial surface of a leaf is the underside or side facing away from the stem"
00002527 00 a 02 adaxial 0 ventral 4 002 ;c 06037666 n 0000 ! 00002312 a 0101 | nearest to or facing toward the axis of an organ or organism; "the upper side of a leaf is known as the adaxial surface"
00002730 00 a 01 acroscopic 0 002 ;c 06066555 n 0000 ! 00002843 a 0101 | facing or on the side toward the apex
00002843 00 a 01 basiscopic 0 002 ;c 06066555 n 0000 ! 00002730 a 0101 | facing or on the side toward the base
00002956 00 a 02 abducent 0 abducting 0 002 ;c 06080522 n 0000 ! 00003131 a 0101 | especially of muscles; drawing away from the midline of the body or from an adjacent part
00003131 00 a 03 adducent 0 adductive 0 adducting 0 003 ;c 06080522 n 0000 + 01449236 v 0201 ! 00002956 a 0101 | especially of muscles; bringing together or drawing toward the midline of the body or toward an adjacent part
00003356 00 a 01 nascent 0 005 + 07320302 n 0103 ! 00003939 a 0101 & 00003553 a 0000 & 00003700 a 0000 & 00003829 a 0000 | being born or beginning; "the nascent chicks"; "a nascent insurgency"
00003553 00 s 02 emergent 0 emerging 0 003 & 00003356 a 0000 + 02625016 v 0102 + 00050693 n 0101 | coming into existence; "an emergent republic"
00003700 00 s 01 dissilient 0 002 & 00003356 a 0000 + 07434782 n 0101 | bursting open with force, as do some ripe seed vessels
基于字典、异常和规则,
spacy
词形还原器的使用主要来自普林斯顿WordNet和他们的Morphy软件。接下来我们可以看到实际的实现,即
spacy
如何使用索引和异常应用规则。
我们返回
https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py。
主要操作来自函数而不是
Lemmatizer
类:
def lemmatize(string, index, exceptions, rules):
string = string.lower()
forms = []
forms.extend(exceptions.get(string, []))
oov_forms = []
for old, new in rules:
if string.endswith(old):
form = string[:len(string) - len(old)] + new
if not form:
pass
elif form in index or not form.isalpha():
forms.append(form)
else:
oov_forms.append(form)
if not forms:
forms.extend(oov_forms)
if not forms:
forms.append(string)
return set(forms)
为什么lemmatize
方法在Lemmatizer
类外面?
我不确定,但可能是为了确保可以在类实例之外调用词形还原函数,但考虑到@staticmethod
和@classmethod
的存在,也许还有其他考虑因素导致函数和类被解耦。
Morphy vs Spacy
将spacy
的lemmatize()函数与nltk中的morphy()
函数进行比较(后者最初来自于十多年前创建的http://blog.osteele.com/2004/04/pywordnet-20/),morphy()
是Oliver Steele的Python版WordNet morphy的主要过程:
- 检查异常列表
- 对输入应用规则,得到y1、y2、y3等
- 返回数据库中的所有匹配项(并检查原始输入)
- 如果没有匹配项,则继续应用规则直到找到匹配项
- 如果找不到任何内容,则返回空列表
对于spacy
,可能仍处于开发阶段,因为在https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L76行有TODO
。
但是一般的流程似乎是:
- 寻找异常情况,如果单词在异常列表中,则获取其词形还原形式。
- 应用规则
- 保存在索引列表中的结果
- 如果步骤1-3都没有找到词形还原形式,则只需跟踪不在词汇表中的单词(OOV),并将原始字符串附加到词形还原形式中
- 返回词形还原形式
就OOV处理而言,如果找不到词形还原的形式,spacy会返回原始字符串, 在这方面,nltk实现的morphy也是一样的,例如:
>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('alvations')
'alvations'
在词形还原之前检查是否为不定式
可能另一个区别点是morphy
和spacy
如何决定将哪个POS分配给该单词。在这方面,spacy
在中放置了一些语言学规则来决定单词是否为基本形式,并且如果单词已经是不定式形式(is_base_form()),则跳过词形还原, 如果要对语料库中的所有单词进行词形还原,并且其中相当一部分是不定式(已经是词根形式),这将节省相当多的时间。
但是,在spacy
中,这是可能的,因为它允许词形还原器访问与某些形态学规则紧密相关的POS。虽然对于morphy
,虽然使用细粒度的PTB POS标记可以找出一些形态,但仍需要花费一些力气来将它们分类以知道哪些形式是不定式。
通常,需要从POS标记中挖掘出3个主要的形态特征信号:
已更新
SpaCy在最初的回答(17年5月12日)后对其词形还原器进行了更改。我认为目的是使词形还原更快,无需查找和规则处理。
因此,他们预先对单词进行了词形还原,并将它们留在查找哈希表中,以使检索O(1)适用于他们预先词形还原的单词 https://github.com/explosion/spaCy/blob/master/spacy/lang/en/lemmatizer/lookup.py
此外,为了统一不同语言的词形还原器,词形还原器现在位于 https://github.com/explosion/spaCy/blob/develop/spacy/lemmatizer.py#L92
但上述底层词形还原步骤仍与当前版本的spacy(4d2d7d586608ddc0bcb2857fb3c2d0d4c151ebfc
)相关
结语
我想现在我们知道了它是如何遵循语言学规则等方面工作的,另一个问题是"是否有基于非规则的词形还原方法?"
但在回答这个问题之前,“什么是词元?”可能是更好的问题。