使用正则表达式将字符串拆分为句子

Question

使用正则表达式将字符串拆分为句子

24

我有一些随机文本存储在$sentences中。使用正则表达式，我想把文本分成句子，如下所示：

function splitSentences($text) {
    $re = '/                # Split sentences on whitespace between them.
        (?<=                # Begin positive lookbehind.
          [.!?]             # Either an end of sentence punct,
        | [.!?][\'"]        # or end of sentence punct and quote.
        )                   # End positive lookbehind.
        (?<!                # Begin negative lookbehind.
          Mr\.              # Skip either "Mr."
        | Mrs\.             # or "Mrs.",
        | T\.V\.A\.         # or "T.V.A.",
                            # or... (you get the idea).
        )                   # End negative lookbehind.
        \s+                 # Split on whitespace between sentences.
        /ix';

    $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
    return $sentences;
}

$sentences = splitSentences($sentences);

print_r($sentences);

它运行良好。

但是，如果存在Unicode字符，则不会将其拆分为句子：

$sentences = 'Entertainment media properties.Â Fairy Tail and Tokyo Ghoul.';

或者是这种情况：

$sentences = "Entertainment media properties.&Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.";

当文本中存在Unicode字符时，我该怎么做才能使它正常工作？

这里有一个ideone用于测试。

赏金信息

我正在寻找一个完整的解决方案。在发表答案之前，请阅读与WiktorStribiżew的评论线程，获取更多关于此问题的相关信息。

- Henrik Petterson

1

一旦符合条件，我将用50点奖励这个问题。 - Henrik Petterson

1

你需要使用 /u 修饰符。 - Wiktor Stribiżew

1

我刚刚用\s*将\s+变成了可选项。我看到Henry很快就能读懂别人的评论 :) - Wiktor Stribiżew

1

@WiktorStribiżew 我明白了，非常感谢你的信息。我会保留这个问题，并且在合适的时候用50个积分悬赏一个“无懈可击”的解决方案，如果能够将其转化为代码。 - Henrik Petterson

1

@WiktorStribiżew，我开了一个200点的赏金，因为想要得到一个经过充分考虑的、完整的解决方案需要付出很大的努力。无论是一个正则表达式还是多个正则表达式都可以。既然你似乎是这里的正则表达式大师，那就请尽情尝试吧;-) - Henrik Petterson

显示剩余11条评论

7个回答

6

Â 是在将UTF-8字符U+00A0非断空格打印到被解释为Latin-1的页面/控制台上时的显示效果。因此，我认为你在句子之间使用的是非断空格，而不是普通空格。 \s也可以匹配非断空格，但你需要使用/u修饰符告诉preg你正在发送一个UTF-8编码的字符串。否则，它会像你的打印命令一样猜测Latin-1，并将其视为两个字符Â 。

- bobince

1

你介意给我提供一个/u修饰符的示例代码吗？因为我似乎无法按照您的建议使其工作。这里有一个http://ideone.com/ZQhPSV供参考。另外，请查看我与WiktorStribiżew上面的对话。 - Henrik Petterson

用/uix替换/ix。 - bobince

我尝试了，但它没有分割句子。请参见：http://ideone.com/m164fp - Henrik Petterson

3

ideone的输入已经是UTF-8编码的，所以通过添加'Â '，你把输入字符串双重UTF-8编码了。请使用真正的输入字符串进行尝试。 - bobince

3

亨利克·彼得森，请完整阅读此内容，因为我需要重复一些已经在上面提到的事情。

像上面许多人提到的那样，如果您添加\u修饰符，则可以在Unicode字符上工作是正确的，并且在下面提到的示例中完美地工作。

http://ideone.com/750lMn

<?php


    function splitSentences($text) {
        $re = '/# Split sentences on whitespace between them.
            (?<=                # Begin positive lookbehind.
              [.!?]             # Either an end of sentence punct,
            | [.!?][\'"]        # or end of sentence punct and quote.
            )                   # End positive lookbehind.
            (?<!                # Begin negative lookbehind.
              Mr\.              # Skip either "Mr."
            | Mrs\.             # or "Mrs.",
            | Ms\.              # or "Ms.",
            | Jr\.              # or "Jr.",
            | Dr\.              # or "Dr.",
            | Prof\.            # or "Prof.",
            | Vol\.             # or "Vol.",
            | A\.D\.            # or "A.D.",
            | B\.C\.            # or "B.C.",
            | Sr\.              # or "Sr.",
            | T\.V\.A\.         # or "T.V.A.",
                                # or... (you get the idea).
            )                   # End negative lookbehind.
            \s+                 # Split on whitespace between sentences.
            /uix';

        $sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
        return $sentences;
    }

$sentences = 'Entertainment media properties. Ã Fairy Tail and Tokyo Ghoul. Entertainment media properties. &Acirc;&nbsp; Fairy Tail and Tokyo Ghoul.';

$sentences = splitSentences($sentences);

print_r($sentences);

您在评论中提供的示例无法正常工作，因为它们在两个句子之间没有任何空格字符。而您的代码明确指定了必须在句子之间有一个空格。

\s+                 # Split on whitespace between sentences.

您在上面评论中提到的以下示例之所以无法工作，只是因为Â前面没有空格。

http://ideone.com/m164fp

- Puneet Singh

3

如果空格不可靠，那么您可以使用匹配符号 . 接着任意数量的空格，再加上一个大写字母。您可以使用 Unicode 字符属性 \p{Lu} 匹配任何大写 UTF-8 字母。您只需要排除缩写词，它们通常跟在自己的名称后面（人名、公司名等），因为它们以大写字母开头。

function splitSentences($text) {
    $re = '/                # Split sentences ending with a dot
        .+?                 # Match everything before, until we find
        (
          $ |               # the end of the string, or
          \.                # a dot
          (?<!              #  Begin negative lookbehind.
            Mr\.            #   Skip either "Mr."
          | Mrs\.           #   or "Mrs.",
                            #   or... (you get the idea).
          )                 #   End negative lookbehind.
          "?                #   Optionally match a quote
          \s*               #   Any number of whitespaces
          (?=               #  Begin positive lookahead
            \p{Lu} |        #   an upper case letter, or
            "               #   a quote
          )
        )
        /iux';

    if (!preg_match_all($re, $text, $matches, PREG_PATTERN_ORDER)) { 
        return [];
    }

    $sentences = array_map('trim', $matches[0]);

    return $sentences;
}

$text = "Mr. Entertainment media properties.Â Fairy Tail 3.5 and Tokyo Ghoul.";
$sentences = splitSentences($text);

print_r($sentences);

注意：这个答案可能对你的情况不够准确。我无法判断。它解决了上述描述的问题，并且易于理解。

- Arnold Daniels

2

我认为得到一个无法出错的句子分割器是不可能的，因为用户生成的内容并不总是语法和句法都正确。此外，由于爬取/获取内容的工具可能无法获得干净的内容，其中可能包含空格或标点符号垃圾，因此达到100％的正确结果是不可能的。最后，业务现在更倾向于“足够好”的策略，如果您能够将文本分割95％的时间，大多数情况下都被视为成功。

现在，任何句子分割任务都是自然语言处理（NLP）任务，而仅使用一、两个或三个正则表达式是不够的。与其考虑自己的正则表达式链，我建议使用一些现有的NLP库来完成这项任务。

1. vanderlee's php-sentence（依赖于合理的语法正确标点符号）

以下是用于分割句子的规则的粗略列表： - 每个换行符都分隔句子。 - 如果没有适当的标点符号结束，文本的末尾表示句子的结尾。 - 除非有换行符或文本末尾，否则句子必须至少包含两个单词。 - 空行不是句子。 - 每个问号、感叹号或它们的组合都被视为句子的结尾。 - 单个句点被视为句子的结尾，除非... - 它前面只有一个单词，或者... - 它后面只有一个单词。 - 多个句点的序列不被视为句子的结尾。

使用示例：

<?php
    require_once 'classes/autoloader.php'; // Include the autoloader.
    $text   = "Hello there, Mr. Smith. What're you doing today... Smith,"
            . " my friend?\n\nI hope it's good. This last sentence will"
            . " cost you $2.50! Just kidding :)"; // This is the test text we're going to use
    $Sentence   = new Sentence;   // Create a new instance
    $sentences  = $Sentence->split($text); // Split into array of sentences
    $count      = $Sentence->count($text); // Count the number of sentences
?>

NlpTools 是另一个你可以用来完成这个任务的库。以下是一个实现基于规则的简单句子分词器的示例代码：

示例代码：

<?php
include ('vendor/autoload.php');
 
use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
use \NlpTools\Tokenizers\WhitespaceTokenizer;
use \NlpTools\Classifiers\ClassifierInterface;
use \NlpTools\Documents\DocumentInterface;
 
class EndOfSentence implements ClassifierInterface
{
    public function classify(array $classes, DocumentInterface $d) {
        list($token,$before,$after) = $d->getDocumentData();
 
        $dotcnt = count(explode('.',$token))-1;
        $lastdot = substr($token,-1)=='.';
 
        if (!$lastdot) // assume that all sentences end in full stops
            return 'O';
 
        if ($dotcnt>1) // to catch some naive abbreviations U.S.A.
            return 'O';
 
        return 'EOW';
    }
}
$tok = new ClassifierBasedTokenizer(
    new EndOfSentence(),
    new WhitespaceTokenizer()
);
$text = "We are what we repeatedly do.
        Excellence, then, is not an act, but a habit.";
 
print_r($tok->tokenize($text));
 
// Array
// (
//    [0] => We are what we repeatedly do.
//    [1] => Excellence, then, is not an act, but a habit.
// )

您可以使用PHP/JAVA bridge来使用Java StanfordNLP（这里有一个Java示例，将文本拆分成句子）。

重要提示：我测试的大多数NLP分词模型都不能很好地处理黏连的句子。但是，如果在标点符号链后面添加一个空格，句子拆分质量会提高。在将文本发送到句子拆分函数之前，请添加此内容：

$txt = preg_replace('~\p{P}+~', "$0 ", $txt);

- Wiktor Stribiżew

谢谢你提供相关脚本的摘要。我有一个问题。在结尾处的preg_replace()正则表达式示例中，它是否在每个标点符号后面添加空格？它到底做了什么？有些情况下是不需要添加空格的。例如："3.50" - Henrik Petterson

它将在每个或多个标点符号后添加一个空格，这对于计算句子很有用。如果您想获取句子，则需要进行一些更复杂的后处理。 - Wiktor Stribiżew

我选择@ndn的答案，但是我非常感谢您抽出时间发布这个答案，这将在我们进行单元测试等方面非常有用。 - Henrik Petterson

2

我知道这个问题很老，已经被@ndnenkov很好地回答了，但我想清理一下PHP代码，使它在处理大量文本时更加高效。

以下是我的更新：

function sentence_split($text) {
    // put regex tests into an easier to read array
    $regexes = array(
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
            "after"=>'/\A(?:)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
            "after"=>'/\A(?:[\p{N}\p{Ll}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
            "after"=>'/\A(?:[^\p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
            "after"=>'/\A(?:[^\p{Lu}]|I)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b[Ee]tc\.\s))\Z/su',
            "after"=>'/\A(?:[^p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
            "after"=>'/\A(?:\p{Ll})/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b\p{L}\.))\Z/su',
            "after"=>'/\A(?:\p{L}\.)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b\p{L}\.\s))\Z/su',
            "after"=>'/\A(?:\p{L}\.\s)/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
            "after"=>'/\A(?:\p{N})/su'
        ],
        [
            "is_sentence_boundary"=>false,
            "before"=>'/(?:(?:[\"”\']\s*))\Z/su',
            "after"=>'/\A(?:\s*\p{Ll})/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
            "after"=>'/\A(?:)/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
            "after"=>'/\A(?:\p{Lu}[^\p{Lu}])/su'
        ],
        [
            "is_sentence_boundary"=>true,
            "before"=>'/(?:(?:\s\p{L}[\.!?…]\s))\Z/su',
            "after"=>'/\A(?:\p{Lu}\p{Ll})/su'
        ]
    );

    $sentences = array();
    $sentence = '';
    $before = '';
    $testLen = 10; // Used to set before/after chunk sizes. 10 seems to be the smallest that works the best.
    $after = substr($text, 0, $testLen); // start with the first set of chars.

    while($text != '') {
        // run regex tests
        foreach($regexes as $reg) {
            if(preg_match($reg["before"], $before) && preg_match($reg["after"], $after)) {
                // if this passes a sentence ending test then add to the array
                if($reg["is_sentence_boundary"]) {
                    $sentences[] = $sentence;
                    $sentence = '';
                }
                break;
            }
        }

        // add the char to the sentence
        $sentence .= $after[0];

        // eat at text until empty to end loop
        $text = substr($text, 1);

        // add a char behind the before var and then remove the first char
        $before = substr($before.$after[0], -$testLen);

        // create a new after with the first chars from the text
        $after = substr($text, 0, $testLen);

    }

    if($sentence != '') {
        $sentences[] = $sentence . $after;
    }
    return $sentences;
}
$text = "Mr. Entertainment media properties.Â Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));

- Jonathan Rowley

1

有一个非常复杂的Unicode文本分割算法，处理包括句子边界在内的各种文本边界。

http://unicode.org/reports/tr29/

这种算法最著名的实现是ICU。

我找到了这个类：http://php.net/manual/en/class.intlbreakiterator.php，但它似乎不在主流中，而是在git中。

因此，如果您想以最佳方式解决这个非常复杂的问题，我建议您：

从某个地方获取此类
编写一个小型的PHP插件，包装您需要的ICU功能——只要构建特定的功能，它实际上是相当简单的。

- Artyom

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ndnenkov · Accepted Answer

正如预期的那样，任何自然语言处理都不是一个简单的任务。原因在于它们是演化系统。没有一个人坐下来思考过哪些是好主意，哪些又不是。每条规则有20-40％的例外情况。话虽如此，以下解决方案主要依赖于正则表达式。

这个想法是逐步遍历文本。
在任何给定时间，文本的当前块将包含两个不同的部分。一个是作为一个子字符串可能出现在句子边界之前的候选部分，另一个是在之后。
前10个正则表达式对检测看起来像句子边界但实际上不是的位置。在这种情况下，before和after会在不注册新句子的情况下前进。
如果这些匹配中没有一个匹配成功，则将尝试使用最后3个正则表达式对进行匹配，可能会检测到一个边界。

至于这些正则表达式从哪里来？ - 我翻译了这个Ruby库，它是基于这篇论文生成的。如果你真的想理解它们，除了阅读论文外别无选择。

至于准确度 - 我鼓励您使用不同的文本进行测试。经过一些实验，我感到非常惊喜。

就性能而言 - 正则表达式应该具有高性能，因为它们都有\A或\Z锚定，几乎没有重复量词，在有重复量词的地方也不能有任何回溯。话虽如此，正则表达式始终是正则表达式。如果您计划在大块文本上的紧密循环中使用它，则必须进行一些基准测试。

强制性免责声明：请原谅我的生锈的php技能。以下代码可能不是最符合惯例的php，但应该足够清楚地传达要点。

function sentence_split($text) {
    $before_regexes = array('/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

$text = "Mr. Entertainment media properties.Â Fairy Tail 3.5 and Tokyo Ghoul.";
print_r(sentence_split($text));