我有一个文本文件,需要获取一个句子列表。
如何实现这个功能?有许多细节要考虑,比如缩写中使用了句点。
我的旧正则表达式效果不好:
re.compile('(\. |^|!|\?)([A-Z][^;↑\.<>@\^&/\[\]]*(\.|!|\?) )',re.M)
nltk.download()
并下载模型 -> punkt
。 - Martin Thomaimport nltk
然后 nltk.sent_tokenize(string)
。 - Yibo Yang# -*- coding: utf-8 -*-
import re
alphabets= "([A-Za-z])"
prefixes = "(Mr|St|Mrs|Ms|Dr)[.]"
suffixes = "(Inc|Ltd|Jr|Sr|Co)"
starters = "(Mr|Mrs|Ms|Dr|Prof|Capt|Cpt|Lt|He\s|She\s|It\s|They\s|Their\s|Our\s|We\s|But\s|However\s|That\s|This\s|Wherever)"
acronyms = "([A-Z][.][A-Z][.](?:[A-Z][.])?)"
websites = "[.](com|net|org|io|gov|edu|me)"
digits = "([0-9])"
multiple_dots = r'\.{2,}'
def split_into_sentences(text: str) -> list[str]:
"""
Split the text into sentences.
If the text contains substrings "<prd>" or "<stop>", they would lead
to incorrect splitting because they are used as markers for splitting.
:param text: text to be split into sentences
:type text: str
:return: list of sentences
:rtype: list[str]
"""
text = " " + text + " "
text = text.replace("\n"," ")
text = re.sub(prefixes,"\\1<prd>",text)
text = re.sub(websites,"<prd>\\1",text)
text = re.sub(digits + "[.]" + digits,"\\1<prd>\\2",text)
text = re.sub(multiple_dots, lambda match: "<prd>" * len(match.group(0)) + "<stop>", text)
if "Ph.D" in text: text = text.replace("Ph.D.","Ph<prd>D<prd>")
text = re.sub("\s" + alphabets + "[.] "," \\1<prd> ",text)
text = re.sub(acronyms+" "+starters,"\\1<stop> \\2",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>\\3<prd>",text)
text = re.sub(alphabets + "[.]" + alphabets + "[.]","\\1<prd>\\2<prd>",text)
text = re.sub(" "+suffixes+"[.] "+starters," \\1<stop> \\2",text)
text = re.sub(" "+suffixes+"[.]"," \\1<prd>",text)
text = re.sub(" " + alphabets + "[.]"," \\1<prd>",text)
if "”" in text: text = text.replace(".”","”.")
if "\"" in text: text = text.replace(".\"","\".")
if "!" in text: text = text.replace("!\"","\"!")
if "?" in text: text = text.replace("?\"","\"?")
text = text.replace(".",".<stop>")
text = text.replace("?","?<stop>")
text = text.replace("!","!<stop>")
text = text.replace("<prd>",".")
sentences = text.split("<stop>")
sentences = [s.strip() for s in sentences]
if sentences and not sentences[-1]: sentences = sentences[:-1]
return sentences
nltk
的比较:>>> from nltk.tokenize import sent_tokenize
例子1: 在这里最好使用split_into_sentences
(因为它明确覆盖了许多情况):
>>> text = 'Some sentence. Mr. Holmes...This is a new sentence!And This is another one.. Hi '
>>> split_into_sentences(text)
['Some sentence.',
'Mr. Holmes...',
'This is a new sentence!',
'And This is another one..',
'Hi']
>>> sent_tokenize(text)
['Some sentence.',
'Mr.',
'Holmes...This is a new sentence!And This is another one.. Hi']
例子2:nltk.tokenize.sent_tokenize
在这里更好(因为它使用了一个机器学习模型):
>>> text = 'The U.S. Drug Enforcement Administration (DEA) says hello. And have a nice day.'
>>> split_into_sentences(text)
['The U.S.',
'Drug Enforcement Administration (DEA) says hello.',
'And have a nice day.']
>>> sent_tokenize(text)
['The U.S. Drug Enforcement Administration (DEA) says hello.',
'And have a nice day.']
prefixes = "(先生|圣|夫人|小姐|博士|教授|船长等)[.]"
、websites = "[.](com|net|org|io|gov|me|edu)"
,如果 text
中包含 "...",则将其替换为 "<prd><prd><prd>"。 - Dascienz除了使用正则表达式将文本分割成句子之外,您还可以使用nltk库。
>>> from nltk import tokenize
>>> p = "Good morning Dr. Adams. The patient is waiting for you in room number 3."
>>> tokenize.sent_tokenize(p)
['Good morning Dr. Adams.', 'The patient is waiting for you in room number 3.']
for sentence in tokenize.sent_tokenize(text): print(sentence)
- Victoria Stuart你可以尝试使用Spacy代替正则表达式。我使用它,它做得很好。
import spacy
nlp = spacy.load('en')
text = '''Your text here'''
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
这里提供了一种中庸的方法,不依赖于任何外部库。我使用列表推导式来排除缩写和终止符之间的重叠以及排除终止符变体之间的重叠,例如:'.'与'."'。
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior',
'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
我使用了Karl在这篇文章中提到的find_all函数: 在Python中查找所有子字符串的出现次数
...
和 ?!
。 - Shane Smiskol我非常喜欢spaCy,但最近发现了两种新的句子分词方法。一种是微软的BlingFire(速度极快),另一种是AI2的PySBD(准确性极高)。
text = ...
from blingfire import text_to_sentences
sents = text_to_sentences(text).split('\n')
from pysbd import Segmenter
segmenter = Segmenter(language='en', clean=False)
sents = segmenter.segment(text)
我使用了五种不同的方法分离了20k个句子。以下是在AMD Threadripper Linux机器上经过的时间:
更新:我尝试将BlingFire用于全小写文本,但它表现非常糟糕。目前我将在我的项目中使用PySBD。
你也可以在NLTK中使用句子分割函数:
from nltk.tokenize import sent_tokenize
sentence = "As the most quoted English writer Shakespeare has more than his share of famous quotes. Some Shakespare famous quotes are known for their beauty, some for their everyday truths and some for their wisdom. We often talk about Shakespeare’s quotes as things the wise Bard is saying to us but, we should remember that some of his wisest words are spoken by his biggest fools. For example, both ‘neither a borrower nor a lender be,’ and ‘to thine own self be true’ are from the foolish, garrulous and quite disreputable Polonius in Hamlet."
sent_tokenize(sentence)
import re
text = ''.join(open('somefile.txt').readlines())
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
*\. +
,它匹配左侧有0个或多个空格和右侧有1个或多个空格的句号(以防止像re.split中的句号被视为句子中的变化)。sentences
中的每个字符串是否以大写字母开头来运行句子列表?)SyntaxError: EOL while scanning string literal
,指向括号(在text
之后)的结束符号。此外,您在文本中引用的正则表达式在您的代码示例中不存在。 - Sabuncur' *[\.\?!][\'"\)\]]* +'
。 - fsociety使用Spacy:
import spacy
nlp = spacy.load('en_core_web_sm')
text = "How are you today? I hope you have a great day"
tokens = nlp(text)
for sent in tokens.sents:
print(sent.string.strip())
既然这是第一个出现的以n个句子拆分的帖子,那么我也可以加入一些内容。
这个方法可以使用可变的拆分长度,它指定了最终连接在一起的句子数量。
import nltk
//nltk.download('punkt')
from more_itertools import windowed
split_length = 3 // 3 sentences for example
elements = nltk.tokenize.sent_tokenize(text)
segments = windowed(elements, n=split_length, step=split_length)
text_splits = []
for seg in segments:
txt = " ".join([t for t in seg if t])
if len(txt) > 0:
text_splits.append(txt)