斯坦福自然语言处理分词器

Question

斯坦福自然语言处理分词器

6

我该如何在Java类中使用Stanford解析器对字符串进行分词？

我只能找到一些使用DocumentProcessor和PTBTokenizer从外部文件获取文本的示例。

 DocumentPreprocessor dp = new DocumentPreprocessor("hello.txt");
   for (List sentence : dp) {
    System.out.println(sentence);
  }
  // option #2: By token

   PTBTokenizer ptbt = new PTBTokenizer(new FileReader("hello.txt"),
          new CoreLabelTokenFactory(), "");
  for (CoreLabel label; ptbt.hasNext(); ) {
    label = (CoreLabel) ptbt.next();
    System.out.println(label);
  }

感谢您的选择。

- Naveen

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CapelliC · Accepted Answer

6

PTBTokenizer构造函数接收一个java.io.Reader参数，然后您可以使用StringReader来解析您的文本。

- CapelliC

你能写构造函数的代码吗？我该如何使用读取器？谢谢。 - Naveen

4

没关系，这会给我提供令牌： List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sentence)).tokenize();System.out.println(rawWords.get(0).value()); - Naveen

1

我花了一些时间打开NetBeans，创建一个新项目等等...然后突然停电了...该死的... - CapelliC

@Naveen 感谢分享你的解决方案！但是，如果您传入不同的句子，那么这样做不会每次创建一个新的PTBTokenizer对象吗？如果您有多个句子，我想您的解决方案的前置步骤是将它们连接成一个字符串“sentences”，然后在“sentences”上使用您的解决方案？ - Nishant Kelkar