在OpenNLP中训练自己的模型

24

我发现创建自己的openNLP模型很困难。有人能告诉我如何拥有模型吗?

训练应该如何进行?输入应该是什么,输出模型文件将存储在哪里?


2
你正在为哪个工具创建模型? - wcolen
4个回答

11

https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html

This website is very useful, it presents code examples and using OpenNLP application to train models such as entity extraction and part of speech. To train the model, you need to create a file that lists the content you want to train, each model expects a different format. After running the file through either the API or opennlp application, it generates a .bin file which can be loaded into a model and used with the API provided on the website.

1
或者你可以说RTFM来节省一些打字时间。 - demongolem
让我指向最新的文档,网址为http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html。 - Suneel Marthi

7

首先,您需要使用所需的实体训练数据。

句子应该用换行符 (\n) 分隔。 值应该用空格字符与 和 标签分开。
假设您想创建药品实体模型,则数据应该是这样的:

<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and 
<START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.

您可以参考一个示例数据集来进行翻译,例如数据集。训练数据应至少有15000个句子以获得更好的结果。
此外,您可以使用Opennlp TokenNameFinderTrainer。输出文件将采用.bin格式。
以下是示例:在Opennlp中编写自定义NameFinder模型 有关更多详细信息,请查看Opennlp文档

2

1

复制 data 中的数据并运行下面的代码,即可获得您自己的 mymodel.bin。

可以参考 data=https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt

public class Training {
       static String onlpModelPath = "mymodel.bin";
       // training data set
       static String trainingDataFilePath = "data.txt";

       public static void main(String[] args) throws IOException {
                       Charset charset = Charset.forName("UTF-8");
                       ObjectStream<String> lineStream = new PlainTextByLineStream(
                                                       new FileInputStream(trainingDataFilePath), charset);
                       ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                                                       lineStream);
                       TokenNameFinderModel model = null;
                       HashMap<String, Object> mp = new HashMap<String, Object>();
                       try {
                              //         model = NameFinderME.train("en","drugs", sampleStream, Collections.<String,Object>emptyMap(),100,4) ;
                                       model=  NameFinderME.train("en", "drugs", sampleStream, Collections. emptyMap());
                       } finally {
                                       sampleStream.close();
                       }
                       BufferedOutputStream modelOut = null;
                       try {
                                       modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
                                       model.serialize(modelOut);
                       } finally {
                                       if (modelOut != null)
                                                       modelOut.close();
                       }
       }
}

欢迎来到 Stack Overflow!虽然这段代码可能有助于解决问题,但它没有解释为什么和/或如何回答问题。提供此附加上下文将显着提高其长期教育价值。请[编辑]您的答案以添加说明,包括适用的限制和假设。 - Toby Speight

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接