我希望从零开始训练一款XLNET语言模型。我首先需要按照以下方式训练分词器:
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files='data.txt', min_frequency=2, special_tokens=[ #defualt vocab size
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
tokenizer.save_model("tokenizer model")
最终,我将在给定目录中有两个文件:
merges.txt
vocab.json
我已经为模型定义了以下配置:
from transformers import XLNetConfig, XLNetModel
config = XLNetConfig()
现在,我希望在transformers中重新创建我的分词器:
from transformers import XLNetTokenizerFast
tokenizer = XLNetTokenizerFast.from_pretrained("tokenizer model")
然而,出现了以下错误:
File "dfgd.py", line 8, in <module>
tokenizer = XLNetTokenizerFast.from_pretrained("tokenizer model")
File "C:\Users\DSP\AppData\Roaming\Python\Python37\site-packages\transformers\tokenization_utils_base.py", line 1777, in from_pretrained
raise EnvironmentError(msg)
OSError: Can't load tokenizer for 'tokenizer model'. Make sure that:
- 'tokenizer model' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'tokenizer model' is the correct path to a directory containing relevant tokenizer files
我该怎么办?
Tokenizer model
只是全路径的替代品吗? - cronoikstr
或os.PathLike
,可选),请参见此处。 - Shijithfrom_pretrained
方法,因为它需要一个tokenizer_config.json
。添加它,它将直接工作。@BNoor - cronoik