从复杂(混合)句子中提取简单句子的算法?

4

有没有一种算法可以用来从段落中提取简单的句子?

我的最终目标是在所得到的简单句上运行另一个算法,以确定作者的情感。

我已经从Chae-Deug Park等来源进行了研究,但没有讨论准备简单句作为训练数据。

提前感谢


你所说的“简单句”具体指什么?是与段落相比只有一个句子,这种情况下你的问题是关于句子边界检测的吗?还是指只包含一个主谓结构(而不是包含从属从句等复杂结构的复合句)的句子?或者完全是其他的意思? - jogojapan
嗨jogojapan,是的,没错,我指的是与段落相比只是一个句子... - John Rambo
你没有准确地定义什么是简单句,所以任何人都很难回答你的问题。也许你想使用类似于斯坦福解析器的工具来获取每个句子的语法树,并且去掉那些不属于“NP VP”类型的句子,即由名词短语后跟动词短语组成的句子(例如“[约翰] [坐在长椅上]”,“[玛丽和吉尔] [吃了他们的三明治]”等)。 - Aditya Mukherji
“简单句”在英语语法中是一个明确定义的概念。我不明白为什么需要在一个标记为“nlp”的SO问题中定义它。对于不涉及NLP的读者,我想@JohnRambo可以提供一个定义链接(例如http://grammar.about.com/od/rs/g/simpsenterm.htm)。 - Chthonic Project
2个回答

2

看一下 Apache OpenNLP,它有一个句子检测模块。文档中提供了如何从命令行和API使用它的示例。


1
我刚刚使用了openNLP做同样的事情。
public static List<String> breakIntoSentencesOpenNlp(String paragraph) throws FileNotFoundException, IOException,
        InvalidFormatException {

    InputStream is = new FileInputStream("resources/models/en-sent.bin");
    SentenceModel model = new SentenceModel(is);
    SentenceDetectorME sdetector = new SentenceDetectorME(model);

    String[] sentDetect = sdetector.sentDetect(paragraph);
    is.close();
    return Arrays.asList(sentDetect);
}

例子
    //Failed at Hi.
    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//not able to break on noone

    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));//breaking on dr.

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesOpenNlp(paragraph).forEach(sentence -> System.out.println(sentence));

只有当出现人为错误时,它才会失败。例如,“Dr.”缩写应该大写D,并且在两个句子之间至少应有1个空格。
您还可以使用以下方式中的RE来实现它;
public static List<String> breakIntoSentencesCustomRESplitter(String paragraph){
    List<String> sentences = new ArrayList<String>();
    Pattern re = Pattern.compile("[^.!?\\s][^.!?]*(?:[.!?](?!['\"]?\\s|$)[^.!?]*)*[.!?]?['\"]?(?=\\s|$)", Pattern.MULTILINE | Pattern.COMMENTS);
    Matcher reMatcher = re.matcher(paragraph);
    while (reMatcher.find()) {
        sentences.add(reMatcher.group());
    }
    return sentences;

}

例子
    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr., mrs.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at U.S.
    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesCustomRESplitter(paragraph).forEach(sentence -> System.out.println(sentence));

但是错误率相当高。另一种方法是使用BreakIterator;
public static List<String> breakIntoSentencesBreakIterator(String paragraph){
    List<String> sentences = new ArrayList<String>();
    BreakIterator sentenceIterator =
            BreakIterator.getSentenceInstance(Locale.ENGLISH);
    BreakIterator sentenceInstance = sentenceIterator.getSentenceInstance();
    sentenceInstance.setText(paragraph);

    int end = sentenceInstance.last();
     for (int start = sentenceInstance.previous();
          start != BreakIterator.DONE;
          end = start, start = sentenceInstance.previous()) {
         sentences.add(paragraph.substring(start,end));
     }

     return sentences;
}

例子:

    paragraph = "Hi. How are you? This is Mike.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Door.Noone
    paragraph = "Close the Door.Noone is out there";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at Mr.
    paragraph = "Really!! I cant believe. Mr. Wilson can come any moment to receive mrs. watson.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    //Failed at dr.
    paragraph = "Radhika, Mohan, and Shaik went to meet dr. Kashyap to raise fund for poor patients.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));


    paragraph = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S. and numbers like 2.2. They all got splitted by the above code.";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

    paragraph = "www.thinkzarahatke.com is the second site I developed. You can send mail to admin@thinkzarahatke.com";
    SentenceDetector.breakIntoSentencesBreakIterator(paragraph).forEach(sentence -> System.out.println(sentence));

基准测试:

  • 自定义正则表达式 : 7 毫秒
  • BreakIterator : 143 毫秒
  • openNlp : 255 毫秒

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接