如何将文本分成句子?

180

我有一个文本文件,需要获取一个句子列表。

如何实现这个功能?有许多细节要考虑,比如缩写中使用了句点。

我的旧正则表达式效果不好:

re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)

我想要做这件事,但是我想在句号或换行符处进行分割。 - yishairasowsky
20个回答

196

自然语言工具包(nltk.org)拥有您所需的功能。这篇群组帖子表明它可以胜任:

import nltk.data

tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
fp = open("test.txt")
data = fp.read()
print '\n-----\n'.join(tokenizer.tokenize(data))

(我还没有尝试过!)


3
@Artyom:它可能可以与俄语一起使用--请参见NLTK/pyNLTK能否“按语言”工作(即非英语),如何做到? - martineau
4
@Artyom:这是 nltk .tokenize.punkt.PunktSentenceTokenizer 的在线文档直接链接。 - martineau
19
可能需要先执行 nltk.download() 并下载模型 -> punkt - Martin Thoma
2
为了节省一些打字,可以使用以下代码:import nltk 然后 nltk.sent_tokenize(string) - Yibo Yang
2
这在以引号结尾的情况下会失败。如果我们有一个以“this.”结尾的句子。 - Fosa
显示剩余7条评论

156
这个函数可以在约0.1秒内将《哈克贝利·费恩历险记》的整个文本分割成句子,并处理许多使句子解析非平凡化的痛苦边缘情况,例如“Mr. John Johnson Jr. was born in the U.S.A but earned his Ph.D. in Israel before joining Nike Inc. as an engineer. He also worked at craigslist.org as a business analyst.”。
# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'

def split_into_sentences(text: str) -> list[str]:
    """
    Split the text into sentences.

    If the text contains substrings "<prd>" or "<stop>", they would lead 
    to incorrect splitting because they are used as markers for splitting.

    :param text: text to be split into sentences
    :type text: str

    :return: list of sentences
    :rtype: list[str]
    """
    text = " " + text + "  "
    text = text.replace("\n"," ")
    text = re.sub(prefixes,"\\1<prd>",text)
    text = re.sub(websites,"<prd>\\1",text)
    text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
    text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
    if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
    text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
    text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
    text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
    text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
    text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
    text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
    if "”" in text: text = text.replace(".”","”.")
    if "\"" in text: text = text.replace(".\"","\".")
    if "!" in text: text = text.replace("!\"","\"!")
    if "?" in text: text = text.replace("?\"","\"?")
    text = text.replace(".",".<stop>")
    text = text.replace("?","?<stop>")
    text = text.replace("!","!<stop>")
    text = text.replace("<prd>",".")
    sentences = text.split("<stop>")
    sentences = [s.strip() for s in sentences]
    if sentences and not sentences[-1]: sentences = sentences[:-1]
    return sentences

nltk 的比较:

>>> from nltk.tokenize import sent_tokenize

例子1: 在这里最好使用split_into_sentences(因为它明确覆盖了许多情况):

>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '

>>> split_into_sentences(text)
['Some sentence.',
 'Mr. Holmes...',
 'This is a new sentence!',
 'And This is another one..',
 'Hi']

>>> sent_tokenize(text)
['Some sentence.',
 'Mr.',
 'Holmes...This is a new sentence!And This is another one.. Hi']

例子2:nltk.tokenize.sent_tokenize在这里更好(因为它使用了一个机器学习模型):

>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'

>>> split_into_sentences(text)
['The U.S.',
 'Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
 'And have a nice day.']

31
这是一个很棒的解决方案。然而,我在正则表达式声明中添加了两行代码 digits = "([0-9])",并在该函数中添加了文本 = re.sub(digits + "[.]" + digits,"\1<prd>\2",text)。现在它不会再在小数点处(如5.5)分割行。感谢您的回答。 - Ameya Kulkarni
2
你是如何解析整个《哈克贝利·费恩历险记》的?它在文本格式中的位置在哪里? - PascalVKooten
7
一个很好的解决方案。在函数中,我添加了以下代码:如果 "e.g." 出现在文本中,则将其替换为 "e<prd>g<prd>";如果 "i.e." 出现在文本中,则将其替换为 "i<prd>e<prd>"。这样完全解决了我的问题。 - Sisay Chala
7
非常棒的解决方案,评论也非常有帮助!为了使其更加健壮一些:prefixes = "(先生|圣|夫人|小姐|博士|教授|船长等)[.]"websites = "[.](com|net|org|io|gov|me|edu)",如果 text 中包含 "...",则将其替换为 "<prd><prd><prd>"。 - Dascienz
1
这个函数能否被修改以将像这样的句子视为一个句子:当一个孩子问她的母亲“婴儿从哪里来?”,应该回答什么? - twhale
显示剩余9条评论

95

除了使用正则表达式将文本分割成句子之外,您还可以使用nltk库。

>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."

>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']

参考: https://dev59.com/qmox5IYBdhLWcg3wCQAq#9474645


比起被采纳的答案来说,这个例子更加优秀、简单和可重用。 - Eli O.
如果你在句点后删除了一个空格,tokenize.sent_tokenize() 将无法工作,但 tokenizer.tokenize() 可以工作!嗯... - Leonid Ganeline
1
for sentence in tokenize.sent_tokenize(text): print(sentence) - Victoria Stuart
可以只翻译两句话吗?以下是程序相关的内容: - Sunil Garg
2
我发现当nltk.tokenize.sent_tokenize遇到i.e.,e.g.等缩写时,会出现错误的分句结果。 - Tedo Vrbanec

20

你可以尝试使用Spacy代替正则表达式。我使用它,它做得很好。

import spacy
nlp = spacy.load('en')

text = '''Your text here'''
tokens = nlp(text)

for sent in tokens.sents:
    print(sent.string.strip())

4
太空非常伟大。但是如果你只是需要把文本分成句子,使用空格来进行分隔会花费太长时间,尤其是当你要处理数据管道时。 - JFerro
@Berlines 我同意,但是我找不到其他库像spaCy一样干净的完成工作。但如果你有任何建议,我可以尝试。 - Elf
3
针对 AWS Lambda 无服务器用户,Spacy 的支持数据文件通常达到数百 MB(英文大型文件超过 400MB),因此您无法直接使用此类内容,这让人非常遗憾(我是 Spacy 的铁杆粉丝)。 - Julian H
1
我发现Spacy在将我的文本分成句子时非常糟糕,会产生一些只包含句点的虚假句子。 - Tedo Vrbanec

10

这里提供了一种中庸的方法,不依赖于任何外部库。我使用列表推导式来排除缩写和终止符之间的重叠以及排除终止符变体之间的重叠,例如:'.'与'."'。

abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
                 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']


def find_sentences(paragraph):
   end = True
   sentences = []
   while end > -1:
       end = find_sentence_end(paragraph)
       if end > -1:
           sentences.append(paragraph[end:].strip())
           paragraph = paragraph[:end]
   sentences.append(paragraph)
   sentences.reverse()
   return sentences


def find_sentence_end(paragraph):
    [possible_endings, contraction_locations] = [[], []]
    contractions = abbreviations.keys()
    sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
    for sentence_terminator in sentence_terminators:
        t_indices = list(find_all(paragraph, sentence_terminator))
        possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
    for contraction in contractions:
        c_indices = list(find_all(paragraph, contraction))
        contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
    possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
    if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
        max_end_start = max([pe[0] for pe in possible_endings])
        possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
    possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
    end = (-1 if not len(possible_endings) else max(possible_endings))
    return end


def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1:
            return
        yield start
        start += len(sub)

我使用了Karl在这篇文章中提到的find_all函数: 在Python中查找所有子字符串的出现次数


1
完美的方法!其他人没有捕捉到 ...?! - Shane Smiskol
干得好。有一个小点需要注意——“i.e.”翻译成中文应为“That is”,而不是“For example”。 - Leon Bambrick

9

我非常喜欢spaCy,但最近发现了两种新的句子分词方法。一种是微软的BlingFire(速度极快),另一种是AI2的PySBD(准确性极高)。

text = ...

from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')

from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)

我使用了五种不同的方法分离了20k个句子。以下是在AMD Threadripper Linux机器上经过的时间:

  • spaCy Sentencizer:1.16934秒
  • spaCy Parse:25.97063秒
  • PySBD:9.03505秒
  • NLTK:0.30512秒
  • BlingFire:0.07933秒

更新:我尝试将BlingFire用于全小写文本,但它表现非常糟糕。目前我将在我的项目中使用PySBD。


1
目前BlingFire在ARM Linux或macOS上无法正常工作。链接 - HappyFace
很抱歉听到这个消息。在我的 AMD Threadripper Linux 电脑上运行得非常好。 - mph
在使用此处定义的51个英语“黄金规则”(https://s3.amazonaws.com/tm-town-nlp-resources/golden_rules.txt,来源于https://github.com/diasks2/pragmatic_segmenter)的子集进行测试时,BlingFire是最准确的选择,并且仅比使用nltk.tokenize.sent_tokenize(text)的NLTK略慢。该子集对我的目的非常相关,仅包括这33条规则:https://pastebin.com/raw/xqJATfcX。 - Spherical Cowboy

9
你可以使用NLTK中的句子分割功能:

你也可以在NLTK中使用句子分割函数:

from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes.  Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."

sent_tokenize(sentence)

我尝试使用它,因为nltk是一个非常好的库,但在缩写方面失败了,它会将其分割,但实际上不应该这样。 - Tedo Vrbanec

7
对于简单的情况(其中句子以正常方式终止),这个应该可以工作:
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)

正则表达式是*\. +,它匹配左侧有0个或多个空格和右侧有1个或多个空格的句号(以防止像re.split中的句号被视为句子中的变化)。
显然,这不是最健壮的解决方案,但在大多数情况下都可以胜任。唯一无法覆盖的情况是缩写词(也许可以通过检查sentences中的每个字符串是否以大写字母开头来运行句子列表?)

39
你无法想象英语中没有句号结尾的情况吗? 想一想! 我对此的回应是,“再想想”。(你看我这样做了吗?) - Ned Batchelder
@Ned 哇,真不敢相信我那么愚蠢。我一定是喝醉了或者什么的。 - Rafe Kettler
我正在使用Python 2.7.2在Win 7 x86上,上述代码中的正则表达式给出了这个错误:SyntaxError: EOL while scanning string literal,指向括号(在text之后)的结束符号。此外,您在文本中引用的正则表达式在您的代码示例中不存在。 - Sabuncu
1
正则表达式不完全正确,应该是r' *[\.\?!][\'"\)\]]* +' - fsociety
它可能会导致许多问题并将句子划分为较小的块。考虑这样一个情况,我们有“我花了3.5美元买这个冰淇淋”,那么这些块就是“我支付了3美元”和“5美元购买这个冰淇淋”。 使用默认的nltk句子标记器更安全! - Reihan_amn

6

使用Spacy

import spacy

nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
    print(sent.string.strip())

3

既然这是第一个出现的以n个句子拆分的帖子,那么我也可以加入一些内容。

这个方法可以使用可变的拆分长度,它指定了最终连接在一起的句子数量。

import nltk
//nltk.download('punkt')
from more_itertools import windowed

split_length = 3 // 3 sentences for example 

elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
          txt = " ".join([t for t in seg if t])
          if len(txt) > 0:
                text_splits.append(txt)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接