spaCy的词性标注和依存关系标注是什么意思?

74
spaCy将每个Document中的Token标记为一个词性(以两种不同格式存储,一个存储在Token的pos和pos_属性中,另一个存储在tag和tag_属性中),并标记其依赖于.head Token的语法依存关系(存储在dep和dep_属性中)。其中一些标签即使对于没有语言学背景的人也很容易理解。
>>> import spacy
>>> en_nlp = spacy.load('en')
>>> document = en_nlp("I shot a man in Reno just to watch him die.")
>>> document[1]
shot
>>> document[1].pos_
'VERB'

其他的......则不是:

>>> document[1].tag_
'VBD'
>>> document[2].pos_
'DET'
>>> document[3].dep_
'dobj'
更糟糕的是,官方文档中甚至没有大多数属性可能标签列表或任何标签的含义。有时会提到他们使用什么分词标准,但这些声明目前并不完全准确,而且标准很难跟踪。

tag_pos_dep_ 属性的可能值是什么,它们的含义是什么?

3
现在有文档可供参考,请查看 https://spacy.io/api/annotation#pos-en 和 https://spacy.io/api/annotation#dependency-parsing-english。 - Suzana
@Suzana,链接又坏了。 - VMAtm
9个回答

125

简短回答

只需展开以下列表:

详细回答

自我提出这个问题以来,文档已经得到了很大的改进,现在spaCy对此进行了更好的文档记录。

词性标注

https://spacy.io/api/annotation#pos-tagging中列出了postag属性,并描述了这些值列表的来源。在此(2020年1月)编辑时,文档对pos属性表示:

spaCy将所有特定于语言的词性标记映射到一个小的、固定的单词类型标记集,遵循通用依存语法方案。通用标记不编码任何形态特征,仅涵盖单词类型。它们可作为Token.posToken.pos_属性使用。

至于tag属性,文档中写道:

英文词性标记器使用OntoNotes 5版本的Penn Treebank标记集。我们还将标记映射到更简单的通用依存关系v2词性标记集。

我将为您翻译以下编程相关内容,涉及德语词性标注器和标记集转换。我们使用TIGER Treebank注释方案,并将标记映射到更简单的通用依赖关系v2 POS标记集。因此,您可以选择使用跨语言一致的粗粒度标记集(.pos),或针对特定树库(因此是特定语言)的细粒度标记集(.tag)。文档列出了以下用于pos和pos_属性的粗粒度标记。
  • ADJ: 形容词,例如:大的,旧的,绿色的,难以理解的,第一个的
  • ADP: 介词,例如:在,到,期间
  • ADV: 副词,例如:很,明天,向下,哪里,那里
  • AUX: 助动词,例如:是,有(做了),将(做),应该(做)
  • CONJ: 连词,例如:和,或,但是
  • CCONJ: 并列连词,例如:和,或,但是
  • DET: 限定词,例如:一个,一只,这个
  • INTJ: 感叹词,例如:嘘,哎呀,好极了,你好
  • NOUN: 名词,例如:女孩,猫,树,空气,美丽
  • NUM: 数词,例如:1,2017,一个,七十七,IV,MMXIV
  • PART: 小品词,例如:’s,不是
  • PRON: 代词,例如:我,你,他,她,我自己,他们自己,有人
  • PROPN: 专有名词,例如:玛丽,约翰,伦敦,北约,HBO
  • PUNCT: 标点符号,例如:.,(),?
  • SCONJ: 从属连词,例如:如果,而,那
  • SYM: 符号,例如:$,%,§,©,+,−,×,÷,=,:)
  • VERB: 动词,例如:跑,跑步,正在跑,吃,吃了,正在吃
  • X: 其他,例如:sfpksdpsxmsa
  • SPACE: 空格,例如:
请注意,文档略微误导人们认为此列表遵循通用依赖关系方案;上面列出的两个标签不属于该方案之一。
其中一个是“CONJ”,它曾经存在于通用POS标记方案中,但自spaCy首次编写以来已被分成“CCONJ”和“SCONJ”。根据文档中标记->pos的映射,似乎spaCy当前的模型实际上并没有使用“CONJ”,但出于某种原因,它仍然存在于spaCy的代码和文档中,可能是与旧模型的向后兼容性有关。
第二个是“SPACE”,它不是通用POS标记方案的一部分(据我所知从来没有),并且由spaCy用于除单个普通ASCII空格(其本身不构成标记)之外的任何间距:
>>> document = en_nlp("This\nsentence\thas      some weird spaces in\n\n\n\n\t\t   it.")
>>> for token in document:
...   print('%r (%s)' % (str(token), token.pos_))
... 
'This' (DET)
'\n' (SPACE)
'sentence' (NOUN)
'\t' (SPACE)
'has' (VERB)
'     ' (SPACE)
'some' (DET)
'weird' (ADJ)
'spaces' (NOUN)
'in' (ADP)
'\n\n\n\n\t\t   ' (SPACE)
'it' (PRON)
'.' (PUNCT)

我将省略在此回答中的完整的.tag_标签列表(更精细的标签),因为它们数量众多,已经有了很好的文档记录,对于英语和德语也是不同的,并且可能更有可能在版本之间发生变化。相反,请查看文档中的列表(例如https://spacy.io/api/annotation#pos-en 的英文),其中列出了每个可能的标记,它映射到的.pos_值以及其含义的描述。

依赖令牌

现在,spaCy使用三种不同的方案进行依存标记:英语, 德语其他语言。再次提醒,值列表很大,我不会在此完整重复。每个依存关系都有一个简短的定义,但不幸的是,其中许多术语 - 如“同位语修饰语”或“从句补语” - 都是艺术术语,对于像我这样的日常程序员来说相当陌生。如果您不是语言学家,则必须研究这些术语的含义才能理解它们。

我至少可以为那些处理英语文本的人提供一些研究的起点。如果你想看一些真实句子中使用的CLEAR依存关系的例子,请查看Jinho D. Choi在2012年的作品:他的Optimization of Natural Language Processing Components for Robustness and Scalability或者他的Guidelines for the CLEAR Style Constituent to Dependency Conversion(似乎只是前一篇论文的一个子部分)。两篇论文列出了2012年存在的所有CLEAR依存标签,包括定义和示例句子。(不幸的是,CLEAR依存标签的集合自2012年以来已经略有变化,因此一些现代标签未在Choi的作品中列出或举例说明,但尽管略有过时,它仍然是一个有用的资源。)


3
另一个理解依赖标签的好参考是斯坦福依赖手册:https://nlp.stanford.edu/software/dependencies_manual.pdf - Nicholas Morley
@NicholasMorley 这是完全不同的依赖关系方案,不是吗?我看到里面有像 npadvmodmwe 这样的东西,它们都不属于 spaCy 的三种依赖关系方案。 - Mark Amery
2
标签的文档已经移动到单个模型label's部分,例如:https://spacy.io/models/en#en_core_web_trf-labels - VMAtm
2
@VMAtm 谢谢!答案应该真的被编辑以显示这个更改,我以为我疯了,试图找到标签列表之类的东西。 - Feathercrown

51

关于获取简写词的详细含义,这里有一个小技巧。您可以使用explain方法,操作如下:

spacy.explain('pobj')

它将会给你输出如下:

'object of preposition'

2
现在我的自我回答(再次)已经过时,这可能是页面上最好的答案。如果有人想编制最新的标签和定义列表,我会让他们去做,但是这个答案至少应该保持有价值,即使标签列表发生变化。 - Mark Amery

11

官方文档现在提供了更多有关所有注释的细节,网址为:https://spacy.io/api/annotation (还可以在https://spacy.io/api/token上找到有关标记的其他属性列表)。

正如文档所示,他们的词性(POS)和依赖标签都具有通用和特定语言的变体,而explain()函数则是一个非常有用的快捷方式,可以获得标签的更好描述,而不需要查阅文档,例如:

spacy.explain("VBD")

"verb, past tense" 表示某个动词的过去式。

9

最近Spacy更新到v3之后,上面的链接已经失效。

您可以访问此链接获取完整列表。

通用POS标签 enter image description here

英文POS标签 enter image description here


9

2

2023年更新

有一个pip包(免责声明:我写了它)叫做spacysee,它可以让您探索Spacy文档的解析输出。我构建它是因为我遇到了这个确切的问题 - 不仅如此,每个模型往往使用不同的标签模式,因此文档也不同 - 在大多数情况下,它只链接到通用依赖关系的相关部分。 输出屏幕截图


2
目前,SpaCy 中的依赖分析和标记似乎仅在单词级别上实现,而不是在短语(除了名词短语)或从句级别上实现。这意味着 SpaCy 可以用于识别名词(NN、NNS)、形容词(JJ、JJR、JJS)和动词(VB、VBD、VBG 等),但不能识别形容词短语(ADJP)、副词短语(ADVP)或疑问句(SBARQ、SQ)。
举例来说,当您使用 SpaCy 解析句子“Which way is the bus going?”时,我们得到 以下树形结构。 相比之下,如果您使用 Stanford 分析器,则会得到 一个更深层次的句法树。

7
在我看来,这并没有回答我提出的问题(尽管这些树很有趣,并且是说明两个解析器之间差异的好例子)。顺便提一下,你在这里描述的是 短语结构分析器(如Stanford)和 依存关系分析器(如spaCy)之间的区别。也可以参考 https://dev59.com/JGkv5IYBdhLWcg3wtTG7#10401433。 - Mark Amery

2

从管道/模型中以编程方式直接检索标签及其含义

与在文档中查找标签的方法不同,您可以从 nlp.pipe_labels 中以编程方式检索这些标签。

这样做的好处是,您可以获得您训练的管道(也称为模型)提供的实际标签,而无需手动复制这些标签。

以下示例代码使用模型 en_core_web_sm。链接到模型卡片 此处。请参见底部的 标签方案。根据您选择的模型进行调整。

注意:通用词性标记不能以编程方式获取(至少我找不到方法),可以在文档中 此处 查找。

import spacy
nlp = spacy.load("en_core_web_sm")

for component in nlp.pipe_names:
    tags = nlp.pipe_labels[component]
    if len(tags)!=0:
        print(f"Label mapping for component: {component}")
        display(dict(list(zip(tags, [spacy.explain(tag) for tag in tags]))))
        print()

输出

Label mapping for component: tagger

{'$': 'symbol, currency',
 "''": 'closing quotation mark',
 ',': 'punctuation mark, comma',
 '-LRB-': 'left round bracket',
 '-RRB-': 'right round bracket',
 '.': 'punctuation mark, sentence closer',
 ':': 'punctuation mark, colon or ellipsis',
 'ADD': 'email',
 'AFX': 'affix',
 'CC': 'conjunction, coordinating',
 'CD': 'cardinal number',
 'DT': 'determiner',
 'EX': 'existential there',
 'FW': 'foreign word',
 'HYPH': 'punctuation mark, hyphen',
 'IN': 'conjunction, subordinating or preposition',
 'JJ': 'adjective (English), other noun-modifier (Chinese)',
 'JJR': 'adjective, comparative',
 'JJS': 'adjective, superlative',
 'LS': 'list item marker',
 'MD': 'verb, modal auxiliary',
 'NFP': 'superfluous punctuation',
 'NN': 'noun, singular or mass',
 'NNP': 'noun, proper singular',
 'NNPS': 'noun, proper plural',
 'NNS': 'noun, plural',
 'PDT': 'predeterminer',
 'POS': 'possessive ending',
 'PRP': 'pronoun, personal',
 'PRP$': 'pronoun, possessive',
 'RB': 'adverb',
 'RBR': 'adverb, comparative',
 'RBS': 'adverb, superlative',
 'RP': 'adverb, particle',
 'SYM': 'symbol',
 'TO': 'infinitival "to"',
 'UH': 'interjection',
 'VB': 'verb, base form',
 'VBD': 'verb, past tense',
 'VBG': 'verb, gerund or present participle',
 'VBN': 'verb, past participle',
 'VBP': 'verb, non-3rd person singular present',
 'VBZ': 'verb, 3rd person singular present',
 'WDT': 'wh-determiner',
 'WP': 'wh-pronoun, personal',
 'WP$': 'wh-pronoun, possessive',
 'WRB': 'wh-adverb',
 'XX': 'unknown',
 '_SP': 'whitespace',
 '``': 'opening quotation mark'}


Label mapping for component: parser

{'ROOT': 'root',
 'acl': 'clausal modifier of noun (adjectival clause)',
 'acomp': 'adjectival complement',
 'advcl': 'adverbial clause modifier',
 'advmod': 'adverbial modifier',
 'agent': 'agent',
 'amod': 'adjectival modifier',
 'appos': 'appositional modifier',
 'attr': 'attribute',
 'aux': 'auxiliary',
 'auxpass': 'auxiliary (passive)',
 'case': 'case marking',
 'cc': 'coordinating conjunction',
 'ccomp': 'clausal complement',
 'compound': 'compound',
 'conj': 'conjunct',
 'csubj': 'clausal subject',
 'csubjpass': 'clausal subject (passive)',
 'dative': 'dative',
 'dep': 'unclassified dependent',
 'det': 'determiner',
 'dobj': 'direct object',
 'expl': 'expletive',
 'intj': 'interjection',
 'mark': 'marker',
 'meta': 'meta modifier',
 'neg': 'negation modifier',
 'nmod': 'modifier of nominal',
 'npadvmod': 'noun phrase as adverbial modifier',
 'nsubj': 'nominal subject',
 'nsubjpass': 'nominal subject (passive)',
 'nummod': 'numeric modifier',
 'oprd': 'object predicate',
 'parataxis': 'parataxis',
 'pcomp': 'complement of preposition',
 'pobj': 'object of preposition',
 'poss': 'possession modifier',
 'preconj': 'pre-correlative conjunction',
 'predet': None,
 'prep': 'prepositional modifier',
 'prt': 'particle',
 'punct': 'punctuation',
 'quantmod': 'modifier of quantifier',
 'relcl': 'relative clause modifier',
 'xcomp': 'open clausal complement'}


Label mapping for component: ner

{'CARDINAL': 'Numerals that do not fall under another type',
 'DATE': 'Absolute or relative dates or periods',
 'EVENT': 'Named hurricanes, battles, wars, sports events, etc.',
 'FAC': 'Buildings, airports, highways, bridges, etc.',
 'GPE': 'Countries, cities, states',
 'LANGUAGE': 'Any named language',
 'LAW': 'Named documents made into laws.',
 'LOC': 'Non-GPE locations, mountain ranges, bodies of water',
 'MONEY': 'Monetary values, including unit',
 'NORP': 'Nationalities or religious or political groups',
 'ORDINAL': '"first", "second", etc.',
 'ORG': 'Companies, agencies, institutions, etc.',
 'PERCENT': 'Percentage, including "%"',
 'PERSON': 'People, including fictional',
 'PRODUCT': 'Objects, vehicles, foods, etc. (not services)',
 'QUANTITY': 'Measurements, as of weight or distance',
 'TIME': 'Times smaller than a day',
 'WORK_OF_ART': 'Titles of books, songs, etc.'}

0

spaCy在其源代码中有一个词汇表,它将标记代码映射到标记标签,用于其POS标记、句法类别、短语类型、依赖标签等。

它非常广泛,包括多个框架(例如通用依赖关系、Penn Treebank等),并支持多种语言。

GLOSSARY = {
    # POS tags
    # Universal POS Tags
    # http://universaldependencies.org/u/pos/
    "ADJ": "adjective",
    "ADP": "adposition",
    "ADV": "adverb",
    "AUX": "auxiliary",
    "CONJ": "conjunction",
    "CCONJ": "coordinating conjunction",
    "DET": "determiner",
    "INTJ": "interjection",
    "NOUN": "noun",
    "NUM": "numeral",
    "PART": "particle",
    "PRON": "pronoun",
    "PROPN": "proper noun",
    "PUNCT": "punctuation",
    "SCONJ": "subordinating conjunction",
    "SYM": "symbol",
    "VERB": "verb",
    "X": "other",
    "EOL": "end of line",
    "SPACE": "space",
    # POS tags (English)
    # OntoNotes 5 / Penn Treebank
    # https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
    ".": "punctuation mark, sentence closer",
    ",": "punctuation mark, comma",
    "-LRB-": "left round bracket",
    "-RRB-": "right round bracket",
    "``": "opening quotation mark",
    '""': "closing quotation mark",
    "''": "closing quotation mark",
    ":": "punctuation mark, colon or ellipsis",
    "$": "symbol, currency",
    "#": "symbol, number sign",
    "AFX": "affix",
    "CC": "conjunction, coordinating",
    "CD": "cardinal number",
    "DT": "determiner",
    "EX": "existential there",
    "FW": "foreign word",
    "HYPH": "punctuation mark, hyphen",
    "IN": "conjunction, subordinating or preposition",
    "JJ": "adjective (English), other noun-modifier (Chinese)",
    "JJR": "adjective, comparative",
    "JJS": "adjective, superlative",
    "LS": "list item marker",
    "MD": "verb, modal auxiliary",
    "NIL": "missing tag",
    "NN": "noun, singular or mass",
    "NNP": "noun, proper singular",
    "NNPS": "noun, proper plural",
    "NNS": "noun, plural",
    "PDT": "predeterminer",
    "POS": "possessive ending",
    "PRP": "pronoun, personal",
    "PRP$": "pronoun, possessive",
    "RB": "adverb",
    "RBR": "adverb, comparative",
    "RBS": "adverb, superlative",
    "RP": "adverb, particle",
    "TO": 'infinitival "to"',
    "UH": "interjection",
    "VB": "verb, base form",
    "VBD": "verb, past tense",
    "VBG": "verb, gerund or present participle",
    "VBN": "verb, past participle",
    "VBP": "verb, non-3rd person singular present",
    "VBZ": "verb, 3rd person singular present",
    "WDT": "wh-determiner",
    "WP": "wh-pronoun, personal",
    "WP$": "wh-pronoun, possessive",
    "WRB": "wh-adverb",
    "SP": "space (English), sentence-final particle (Chinese)",
    "ADD": "email",
    "NFP": "superfluous punctuation",
    "GW": "additional word in multi-word expression",
    "XX": "unknown",
    "BES": 'auxiliary "be"',
    "HVS": 'forms of "have"',
    "_SP": "whitespace",
    # POS Tags (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
    "$(": "other sentence-internal punctuation mark",
    "$,": "comma",
    "$.": "sentence-final punctuation mark",
    "ADJA": "adjective, attributive",
    "ADJD": "adjective, adverbial or predicative",
    "APPO": "postposition",
    "APPR": "preposition; circumposition left",
    "APPRART": "preposition with article",
    "APZR": "circumposition right",
    "ART": "definite or indefinite article",
    "CARD": "cardinal number",
    "FM": "foreign language material",
    "ITJ": "interjection",
    "KOKOM": "comparative conjunction",
    "KON": "coordinate conjunction",
    "KOUI": 'subordinate conjunction with "zu" and infinitive',
    "KOUS": "subordinate conjunction with sentence",
    "NE": "proper noun",
    "NNE": "proper noun",
    "PAV": "pronominal adverb",
    "PROAV": "pronominal adverb",
    "PDAT": "attributive demonstrative pronoun",
    "PDS": "substituting demonstrative pronoun",
    "PIAT": "attributive indefinite pronoun without determiner",
    "PIDAT": "attributive indefinite pronoun with determiner",
    "PIS": "substituting indefinite pronoun",
    "PPER": "non-reflexive personal pronoun",
    "PPOSAT": "attributive possessive pronoun",
    "PPOSS": "substituting possessive pronoun",
    "PRELAT": "attributive relative pronoun",
    "PRELS": "substituting relative pronoun",
    "PRF": "reflexive personal pronoun",
    "PTKA": "particle with adjective or adverb",
    "PTKANT": "answer particle",
    "PTKNEG": "negative particle",
    "PTKVZ": "separable verbal particle",
    "PTKZU": '"zu" before infinitive',
    "PWAT": "attributive interrogative pronoun",
    "PWAV": "adverbial interrogative or relative pronoun",
    "PWS": "substituting interrogative pronoun",
    "TRUNC": "word remnant",
    "VAFIN": "finite verb, auxiliary",
    "VAIMP": "imperative, auxiliary",
    "VAINF": "infinitive, auxiliary",
    "VAPP": "perfect participle, auxiliary",
    "VMFIN": "finite verb, modal",
    "VMINF": "infinitive, modal",
    "VMPP": "perfect participle, modal",
    "VVFIN": "finite verb, full",
    "VVIMP": "imperative, full",
    "VVINF": "infinitive, full",
    "VVIZU": 'infinitive with "zu", full',
    "VVPP": "perfect participle, full",
    "XY": "non-word containing non-letter",
    # POS Tags (Chinese)
    # OntoNotes / Chinese Penn Treebank
    # https://repository.upenn.edu/cgi/viewcontent.cgi?article=1039&context=ircs_reports
    "AD": "adverb",
    "AS": "aspect marker",
    "BA": "把 in ba-construction",
    # "CD": "cardinal number",
    "CS": "subordinating conjunction",
    "DEC": "的 in a relative clause",
    "DEG": "associative 的",
    "DER": "得 in V-de const. and V-de-R",
    "DEV": "地 before VP",
    "ETC": "for words 等, 等等",
    # "FW": "foreign words"
    "IJ": "interjection",
    # "JJ": "other noun-modifier",
    "LB": "被 in long bei-const",
    "LC": "localizer",
    "M": "measure word",
    "MSP": "other particle",
    # "NN": "common noun",
    "NR": "proper noun",
    "NT": "temporal noun",
    "OD": "ordinal number",
    "ON": "onomatopoeia",
    "P": "preposition excluding 把 and 被",
    "PN": "pronoun",
    "PU": "punctuation",
    "SB": "被 in short bei-const",
    # "SP": "sentence-final particle",
    "VA": "predicative adjective",
    "VC": "是 (copula)",
    "VE": "有 as the main verb",
    "VV": "other verb",
    # Noun chunks
    "NP": "noun phrase",
    "PP": "prepositional phrase",
    "VP": "verb phrase",
    "ADVP": "adverb phrase",
    "ADJP": "adjective phrase",
    "SBAR": "subordinating conjunction",
    "PRT": "particle",
    "PNP": "prepositional noun phrase",
    # Dependency Labels (English)
    # ClearNLP / Universal Dependencies
    # https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md
    "acl": "clausal modifier of noun (adjectival clause)",
    "acomp": "adjectival complement",
    "advcl": "adverbial clause modifier",
    "advmod": "adverbial modifier",
    "agent": "agent",
    "amod": "adjectival modifier",
    "appos": "appositional modifier",
    "attr": "attribute",
    "aux": "auxiliary",
    "auxpass": "auxiliary (passive)",
    "case": "case marking",
    "cc": "coordinating conjunction",
    "ccomp": "clausal complement",
    "clf": "classifier",
    "complm": "complementizer",
    "compound": "compound",
    "conj": "conjunct",
    "cop": "copula",
    "csubj": "clausal subject",
    "csubjpass": "clausal subject (passive)",
    "dative": "dative",
    "dep": "unclassified dependent",
    "det": "determiner",
    "discourse": "discourse element",
    "dislocated": "dislocated elements",
    "dobj": "direct object",
    "expl": "expletive",
    "fixed": "fixed multiword expression",
    "flat": "flat multiword expression",
    "goeswith": "goes with",
    "hmod": "modifier in hyphenation",
    "hyph": "hyphen",
    "infmod": "infinitival modifier",
    "intj": "interjection",
    "iobj": "indirect object",
    "list": "list",
    "mark": "marker",
    "meta": "meta modifier",
    "neg": "negation modifier",
    "nmod": "modifier of nominal",
    "nn": "noun compound modifier",
    "npadvmod": "noun phrase as adverbial modifier",
    "nsubj": "nominal subject",
    "nsubjpass": "nominal subject (passive)",
    "nounmod": "modifier of nominal",
    "npmod": "noun phrase as adverbial modifier",
    "num": "number modifier",
    "number": "number compound modifier",
    "nummod": "numeric modifier",
    "oprd": "object predicate",
    "obj": "object",
    "obl": "oblique nominal",
    "orphan": "orphan",
    "parataxis": "parataxis",
    "partmod": "participal modifier",
    "pcomp": "complement of preposition",
    "pobj": "object of preposition",
    "poss": "possession modifier",
    "possessive": "possessive modifier",
    "preconj": "pre-correlative conjunction",
    "prep": "prepositional modifier",
    "prt": "particle",
    "punct": "punctuation",
    "quantmod": "modifier of quantifier",
    "rcmod": "relative clause modifier",
    "relcl": "relative clause modifier",
    "reparandum": "overridden disfluency",
    "root": "root",
    "ROOT": "root",
    "vocative": "vocative",
    "xcomp": "open clausal complement",
    # Dependency labels (German)
    # TIGER Treebank
    # http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/tiger_introduction.pdf
    # currently missing: 'cc' (comparative complement) because of conflict
    # with English labels
    "ac": "adpositional case marker",
    "adc": "adjective component",
    "ag": "genitive attribute",
    "ams": "measure argument of adjective",
    "app": "apposition",
    "avc": "adverbial phrase component",
    "cd": "coordinating conjunction",
    "cj": "conjunct",
    "cm": "comparative conjunction",
    "cp": "complementizer",
    "cvc": "collocational verb construction",
    "da": "dative",
    "dh": "discourse-level head",
    "dm": "discourse marker",
    "ep": "expletive es",
    "hd": "head",
    "ju": "junctor",
    "mnr": "postnominal modifier",
    "mo": "modifier",
    "ng": "negation",
    "nk": "noun kernel element",
    "nmc": "numerical component",
    "oa": "accusative object",
    "oc": "clausal object",
    "og": "genitive object",
    "op": "prepositional object",
    "par": "parenthetical element",
    "pd": "predicate",
    "pg": "phrasal genitive",
    "ph": "placeholder",
    "pm": "morphological particle",
    "pnc": "proper noun component",
    "rc": "relative clause",
    "re": "repeated element",
    "rs": "reported speech",
    "sb": "subject",
    "sbp": "passivized subject (PP)",
    "sp": "subject or predicate",
    "svp": "separable verb prefix",
    "uc": "unit component",
    "vo": "vocative",
    # Named Entity Recognition
    # OntoNotes 5
    # https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf
    "PERSON": "People, including fictional",
    "NORP": "Nationalities or religious or political groups",
    "FACILITY": "Buildings, airports, highways, bridges, etc.",
    "FAC": "Buildings, airports, highways, bridges, etc.",
    "ORG": "Companies, agencies, institutions, etc.",
    "GPE": "Countries, cities, states",
    "LOC": "Non-GPE locations, mountain ranges, bodies of water",
    "PRODUCT": "Objects, vehicles, foods, etc. (not services)",
    "EVENT": "Named hurricanes, battles, wars, sports events, etc.",
    "WORK_OF_ART": "Titles of books, songs, etc.",
    "LAW": "Named documents made into laws.",
    "LANGUAGE": "Any named language",
    "DATE": "Absolute or relative dates or periods",
    "TIME": "Times smaller than a day",
    "PERCENT": 'Percentage, including "%"',
    "MONEY": "Monetary values, including unit",
    "QUANTITY": "Measurements, as of weight or distance",
    "ORDINAL": '"first", "second", etc.',
    "CARDINAL": "Numerals that do not fall under another type",
    # Named Entity Recognition
    # Wikipedia
    # http://www.sciencedirect.com/science/article/pii/S0004370212000276
    # https://pdfs.semanticscholar.org/5744/578cc243d92287f47448870bb426c66cc941.pdf
    "PER": "Named person or family.",
    "MISC": "Miscellaneous entities, e.g. events, nationalities, products or works of art",
    # https://github.com/ltgoslo/norne
    "EVT": "Festivals, cultural events, sports events, weather phenomena, wars, etc.",
    "PROD": "Product, i.e. artificially produced entities including speeches, radio shows, programming languages, contracts, laws and ideas",
    "DRV": "Words (and phrases?) that are dervied from a name, but not a name in themselves, e.g. 'Oslo-mannen' ('the man from Oslo')",
    "GPE_LOC": "Geo-political entity, with a locative sense, e.g. 'John lives in Spain'",
    "GPE_ORG": "Geo-political entity, with an organisation sense, e.g. 'Spain declined to meet with Belgium'",
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接