在OpenNLP中训练自己的模型

Question

在OpenNLP中训练自己的模型

filemodelopennlp

24

我发现创建自己的openNLP模型很困难。有人能告诉我如何拥有模型吗？

训练应该如何进行？输入应该是什么，输出模型文件将存储在哪里？

- user1482228

2

你正在为哪个工具创建模型？ - wcolen

4个回答

7

首先，您需要使用所需的实体训练数据。

句子应该用换行符 (\n) 分隔。值应该用空格字符与和标签分开。
假设您想创建药品实体模型，则数据应该是这样的：

<START:medicine> Augmentin-Duo <END> is a penicillin antibiotic that contains two medicines - <START:medicine> amoxicillin trihydrate <END> and 
<START:medicine> potassium clavulanate <END>. They work together to kill certain types of bacteria and are used to treat certain types of bacterial infections.

您可以参考一个示例数据集来进行翻译，例如数据集。训练数据应至少有15000个句子以获得更好的结果。

此外，您可以使用Opennlp TokenNameFinderTrainer。输出文件将采用.bin格式。

以下是示例：在Opennlp中编写自定义NameFinder模型有关更多详细信息，请查看Opennlp文档。

- Nishu Tayal

2

也许这篇文章能帮到你。它描述了如何从维基百科中提取数据进行TokenNameFinder训练...

nuxeo - 博客 - 使用Hadoop和Pig挖掘维基百科进行自然语言处理

- Oto Brglez

1

复制 data 中的数据并运行下面的代码，即可获得您自己的 mymodel.bin。

可以参考 data=https://github.com/mccraigmccraig/opennlp/blob/master/src/test/resources/opennlp/tools/namefind/AnnotatedSentencesWithTypes.txt。

public class Training {
       static String onlpModelPath = "mymodel.bin";
       // training data set
       static String trainingDataFilePath = "data.txt";

       public static void main(String[] args) throws IOException {
                       Charset charset = Charset.forName("UTF-8");
                       ObjectStream<String> lineStream = new PlainTextByLineStream(
                                                       new FileInputStream(trainingDataFilePath), charset);
                       ObjectStream<NameSample> sampleStream = new NameSampleDataStream(
                                                       lineStream);
                       TokenNameFinderModel model = null;
                       HashMap<String, Object> mp = new HashMap<String, Object>();
                       try {
                              //         model = NameFinderME.train("en","drugs", sampleStream, Collections.<String,Object>emptyMap(),100,4) ;
                                       model=  NameFinderME.train("en", "drugs", sampleStream, Collections. emptyMap());
                       } finally {
                                       sampleStream.close();
                       }
                       BufferedOutputStream modelOut = null;
                       try {
                                       modelOut = new BufferedOutputStream(new FileOutputStream(onlpModelPath));
                                       model.serialize(modelOut);
                       } finally {
                                       if (modelOut != null)
                                                       modelOut.close();
                       }
       }
}

- user6858643

欢迎来到 Stack Overflow！虽然这段代码可能有助于解决问题，但它没有解释为什么和/或如何回答问题。提供此附加上下文将显着提高其长期教育价值。请[编辑]您的答案以添加说明，包括适用的限制和假设。 - Toby Speight

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- andrew.butkus · Accepted Answer

https://opennlp.apache.org/docs/1.5.3/manual/opennlp.html

This website is very useful, it presents code examples and using OpenNLP application to train models such as entity extraction and part of speech. To train the model, you need to create a file that lists the content you want to train, each model expects a different format. After running the file through either the API or opennlp application, it generates a .bin file which can be loaded into a model and used with the API provided on the website.