使用NLTK将一个段落分割成句子，然后再将句子分割成单词。

Question

使用NLTK将一个段落分割成句子，然后再将句子分割成单词。

50

我正在尝试将整个段落输入到我的文字处理器中，首先将其分割成句子，然后再分割成单词。

我尝试的以下代码似乎不起作用：

# Text is the paragraph input
sent_text = sent_tokenize(text)
tokenized_text = word_tokenize(sent_text.split)
tagged = nltk.pos_tag(tokenized_text)
print(tagged)

然而，这个方法并不起作用，而且会给我带来错误。那么，我应该如何将段落分割成句子，然后再将句子分割成单词呢？

以下是我正在使用的一个段落（注意：这是来自公共领域的短篇小说《一只深棕色的狗》作者是斯蒂芬·克莱恩）。

This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child.

- Nikhil Raghavendra

你能发一份 text 的样例吗？ - alvas

@alvas 这只是一个随机段落。 - Nikhil Raghavendra

展示输入，因为代码会根据编码、形状和输入差异而不同。 - alvas

1

展示一个实际的样例输入...如果它只是普通的英文文本（不是社交媒体，如Twitter），你可以轻松地使用Python3中的[pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]来解决大多数utf-8问题。但如果你的输入是不同的编码/格式，你将在后面遇到更多问题。 - alvas

请将您的文件复制/样本上传到Dropbox或其他类似网站上并与我们分享。也许我们能够帮助您，也可能无法帮助。 - alvas

显示剩余4条评论

3个回答

14

以下是简化版。这将为您提供一个数据结构，其中包含每个单独的句子和每个句子中的标记。我更喜欢TweetTokenizer用于混乱的现实世界语言。句子分词器被认为是不错的，但要小心不要在此步骤之前降低您的单词大小写，因为它可能会影响检测混乱文本边界的准确性。

from nltk.tokenize import TweetTokenizer, sent_tokenize

tokenizer_words = TweetTokenizer()
tokens_sentences = [tokenizer_words.tokenize(t) for t in 
nltk.sent_tokenize(input_text)]
print(tokens_sentences)

以下是我清理后突出结构的输出结果：

[
['This', 'thing', 'seemed', 'to', 'overpower', 'and', 'astonish', 'the', 'little', 'dark-brown', 'dog', ',', 'and', 'wounded', 'him', 'to', 'the', 'heart', '.'], 
['He', 'sank', 'down', 'in', 'despair', 'at', 'the', "child's", 'feet', '.'], 
['When', 'the', 'blow', 'was', 'repeated', ',', 'together', 'with', 'an', 'admonition', 'in', 'childish', 'sentences', ',', 'he', 'turned', 'over', 'upon', 'his', 'back', ',', 'and', 'held', 'his', 'paws', 'in', 'a', 'peculiar', 'manner', '.'], 
['At', 'the', 'same', 'time', 'with', 'his', 'ears', 'and', 'his', 'eyes', 'he', 'offered', 'a', 'small', 'prayer', 'to', 'the', 'child', '.']
]

- Brian Cugelman

2

感谢您提供有关TweetTokenizer的信息！ - information_interchange

5

import nltk  

textsample ="This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child."  

sentences = nltk.sent_tokenize(textsample)  
words = nltk.word_tokenize(textsample)  
sentences 
[w for w in words if w.isalpha()]

上述最后一行代码将确保输出中仅含有单词而不包含特殊字符。以下是输出的句子。

['This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart.',
 "He sank down in despair at the child's feet.",
 'When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner.',
 'At the same time with his ears and his eyes he offered a small prayer to the child.']

删除特殊字符后，输出结果如下：

['This',
 'thing',
 'seemed',
 'to',
 'overpower',
 'and',
 'astonish',
 'the',
 'little',
 'dog',
 'and',
 'wounded',
 'him',
 'to',
 'the',
 'heart',
 'He',
 'sank',
 'down',
 'in',
 'despair',
 'at',
 'the',
 'child',
 'feet',
 'When',
 'the',
 'blow',
 'was',
 'repeated',
 'together',
 'with',
 'an',
 'admonition',
 'in',
 'childish',
 'sentences',
 'he',
 'turned',
 'over',
 'upon',
 'his',
 'back',
 'and',
 'held',
 'his',
 'paws',
 'in',
 'a',
 'peculiar',
 'manner',
 'At',
 'the',
 'same',
 'time',
 'with',
 'his',
 'ears',
 'and',
 'his',
 'eyes',
 'he',
 'offered',
 'a',
 'small',
 'prayer',
 'to',
 'the',
 'child']

- Sripathi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- slider · Accepted Answer

你可能打算循环遍历sent_text：

import nltk

sent_text = nltk.sent_tokenize(text) # this gives us a list of sentences
# now loop over each sentence and tokenize it separately
for sentence in sent_text:
    tokenized_text = nltk.word_tokenize(sentence)
    tagged = nltk.pos_tag(tokenized_text)
    print(tagged)