如何找到spaCy模型的词汇量大小？

Question

如何找到spaCy模型的词汇量大小？

nlpdocumentationspacyvocabulary

5

我试图找到大型英语模型的词汇量，即en_core_web_lg，并找到了三个不同的信息来源：

spaCy文档：685k个键、685k个唯一向量
nlp.vocab.__len__()：1340242 #（词汇表中的单词数）
len(vocab.strings)：1476045

这三者之间有什么区别？我在文档中没有找到答案。

- Yannis Ch

2个回答

1

自从spaCy 2.3+版本起，根据发布说明，词元不再加载在nlp.vocab中；因此使用len(nlp.vocab)是无效的。相反，使用nlp.meta['vectors']来查找唯一向量和单词的数量。以下是发布说明中相关部分：

To reduce the initial loading time, the lexemes in nlp.vocab are no longer loaded on initialization for models with vectors. As you process texts, the lexemes will be added to the vocab automatically, just as in small models without vectors.

To see the number of unique vectors and number of words with vectors, see nlp.meta['vectors'], for example for en_core_web_md there are 20000 unique vectors and 684830 words with vectors:
{
    'width': 300,
    'vectors': 20000,
    'keys': 684830,
    'name': 'en_core_web_md.vectors'
}

- today

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- aab · Accepted Answer

最有用的数字是与词向量相关的数字。 nlp.vocab.vectors.n_keys告诉您有多少个标记具有单词向量，而len(nlp.vocab.vectors)告诉您有多少个唯一的单词向量（多个标记可以指向在md模型中相同的单词向量）。 len(vocab)是缓存的词汇表中的数量。在md和lg模型中，这些1340242词素中的大部分都具有某些预计算功能（例如Token.prob），但是此缓存中可能存在其他没有预计算功能的词素，因为在处理文本时可以添加更多条目。 len(vocab.strings)是与标记和注释（如nsubj或NOUN）相关的字符串数量，因此它不是特别有用的数字。在训练或处理过程中使用的所有字符串都存储在此处，以便在需要时可以将内部整数哈希转换回字符串。