我正在尝试将整个段落输入到我的文字处理器中,首先将其分割成句子,然后再分割成单词。
我尝试的以下代码似乎不起作用:
然而,这个方法并不起作用,而且会给我带来错误。那么,我应该如何将段落分割成句子,然后再将句子分割成单词呢?
以下是我正在使用的一个段落(注意:这是来自公共领域的短篇小说《一只深棕色的狗》作者是斯蒂芬·克莱恩)。
我尝试的以下代码似乎不起作用:
# Text is the paragraph input
sent_text = sent_tokenize(text)
tokenized_text = word_tokenize(sent_text.split)
tagged = nltk.pos_tag(tokenized_text)
print(tagged)
然而,这个方法并不起作用,而且会给我带来错误。那么,我应该如何将段落分割成句子,然后再将句子分割成单词呢?
以下是我正在使用的一个段落(注意:这是来自公共领域的短篇小说《一只深棕色的狗》作者是斯蒂芬·克莱恩)。
This thing seemed to overpower and astonish the little dark-brown dog, and wounded him to the heart. He sank down in despair at the child's feet. When the blow was repeated, together with an admonition in childish sentences, he turned over upon his back, and held his paws in a peculiar manner. At the same time with his ears and his eyes he offered a small prayer to the child.
text
的样例吗? - alvas[pos_tag(word_tokenize(sent)) for sent in sent_tokenize(text)]
来解决大多数utf-8问题。但如果你的输入是不同的编码/格式,你将在后面遇到更多问题。 - alvas