将字符串分割成句子

Question

将字符串分割成句子

29

我编写了这段代码，将字符串分割并存储在字符串数组中：

String[] sSentence = sResult.split("[a-z]\\.\\s+");

然而，我添加了 [a-z] 是因为我想解决缩写问题。但是我的结果显示如下：

此外，当Everett试图教他们基本数学时，他们证明了无反应

我发现我失去了在 split 函数中指定的模式。失去句号对我来说没关系，但失去单词的最后一个字母会影响它的含义。

请问有人能帮我解决这个问题吗？此外，有人能帮我处理缩写吗？例如，由于我按句号拆分字符串，我不想丢失缩写。

- leba-lev

4个回答

13

要在所有情况下让正则表达式起作用可能比较困难，但为了解决你的即时问题，你可以使用反向预查：

String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");

结果：

This is a test
This is a T.L.A. test.

请注意有些缩写词并不以大写字母结尾，例如abbrev.，Mr.等。同时，也存在不以句号结束的句子！

- Mark Byers

1

这将在9.3%的句子中失败。还有使用省略号的句子。还有拼写错误的句子等等。无论你做什么，从人类的角度来看，你的代码都会犯错。 - Stephen C

4

如果可以的话，使用自然语言处理工具，例如LingPipe。使用正则表达式很难捕捉到许多微妙之处，例如（e.g. :-))，Mr.，缩写，省略号（...），等等。

在LingPipe网站上有一个非常易于跟随的句子检测教程。

- João Silva

嗨，我看了一下教程。它看起来很完美，但是我似乎无法弄清楚如何在eclipse中使用它。你能帮我吗？ - leba-lev

2

晚回复，但对于像我这样的未来访问者以及长时间的搜索后。使用OpenNlP模型，这是我的最佳选择，在所有文本样本中都可以使用，包括@nbz在评论中提到的关键样本。

My friend, Mr. Jones, has a new dog. This is a test. This is a T.L.A. test. Now with a Dr. in it."

间隔一行：

My friend, Mr. Jones, has a new dog.
This is a test.
This is a T.L.A. test.
Now with a Dr. in it.

您需要导入项目中的.jar库以及训练好的模型en-sent.bin。

这是一个教程，可以让您轻松快捷地运行：

https://www.tutorialkart.com/opennlp/sentence-detection-example-in-opennlp/

另一个用于在Eclipse中进行设置：

https://www.tutorialkart.com/opennlp/how-to-setup-opennlp-java-project/

这是代码的样子：

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
 
import com.fasterxml.jackson.databind.exc.InvalidFormatException;
 
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
 
/**
* Sentence Detection Example in openNLP using Java
* @author tutorialkart
*/
public class SentenceDetectExample {
 
    public static void main(String[] args) {
        try {
            new SentenceDetectExample().sentenceDetect();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
 
    /**
     * This method is used to detect sentences in a paragraph/string
     * @throws InvalidFormatException
     * @throws IOException
     */
    public void sentenceDetect() throws InvalidFormatException, IOException {
        String paragraph = "This is a statement. This is another statement. Now is an abstract word for time, that is always flying.";
 
        // refer to model file "en-sent,bin", available at link http://opennlp.sourceforge.net/models-1.5/
        InputStream is = new FileInputStream("en-sent.bin");
        SentenceModel model = new SentenceModel(is);
        
        // feed the model to SentenceDetectorME class
        SentenceDetectorME sdetector = new SentenceDetectorME(model);
        
        // detect sentences in the paragraph
        String sentences[] = sdetector.sentDetect(paragraph);
 
        // print the sentences detected, to console
        for(int i=0;i<sentences.length;i++){
            System.out.println(sentences[i]);
        }
        is.close();
    }
}

由于您实现了库，因此它也可以离线工作，这是一个巨大的优点，正如@Julien Silland所说的正确答案，这不是一个直观的过程，并且让经过训练的模型为您完成它是最好的选择。

- Jouan H. Sulaiman

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Julien Silland · Accepted Answer

解析句子远非易事，即使是像英语这样的拉丁语言。像你在问题中提出的那样天真的方法会经常失败，从而在实践中被证明是无用的。

更好的方法是使用配置了正确地区设置的BreakIterator。

BreakIterator iterator = BreakIterator.getSentenceInstance(Locale.US);
String source = "This is a test. This is a T.L.A. test. Now with a Dr. in it.";
iterator.setText(source);
int start = iterator.first();
for (int end = iterator.next();
    end != BreakIterator.DONE;
    start = end, end = iterator.next()) {
  System.out.println(source.substring(start,end));
}

产生以下结果：

这是一个测试。
这是一个 T.L.A. 测试。
现在有一个 Dr. 在里面。