需要一个用于文本文档分词的Python模块。

Question

需要一个用于文本文档分词的Python模块。

pythonmodulepreprocessornlpstemming

20

我需要一个用于文本预处理阶段的好的Python词干提取模块。

我找到了这个：

http://pypi.python.org/pypi/PyStemmer/1.0.1

但是我在提供的链接中找不到文档。

如果有人知道在哪里可以找到文档或任何其他好的词干提取算法，请帮忙。

- Kai

5个回答

8

所有讨论过的词干提取器都是基于算法的，因此它们可能会产生意料之外的结果，例如：

In [3]: from nltk.stem.porter import *

In [4]: stemmer = PorterStemmer()

In [5]: stemmer.stem('identified')
Out[5]: u'identifi'

In [6]: stemmer.stem('nonsensical')
Out[6]: u'nonsens'

为了正确获取词根，需要使用基于字典的词干提取器，例如Hunspell Stemmer。以下是它的Python实现，链接在这里：link。示例代码在下面：

>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/myspell/en_US.dic', '/usr/share/myspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spooky', 'spook', 'spoonbill']
>>> hobj.spell('spooky')
True
>>> hobj.analyze('linked')
[' st:link fl:D']
>>> hobj.stem('linked')
['link']

- 0xF

6

词干处理的目的不是找到单词的原形（或者词形还原，nltk也有相应模块），而是找到一个缩短版的单词，其他变形也会缩短成同样长度。如果词干处理器没有找到原形，也没关系；只要 stem('nonsense') == stem('nonsensical') != stem('bananas') 就可以了。 - umop aplsdn

7

Python stemming模块包含了多种词干提取算法的实现，如Porter、Porter2、Paice-Husk和Lovins。 http://pypi.python.org/pypi/stemming/1.0

    >> from stemming.porter2 import stem
    >> stem("factionally")
    faction

- shiva

请注意，这是一个纯Python实现，与像PyStemmer这样的快速C实现的包装器相比，在大规模情况下性能较慢。 - Varun Balupuri

3

话题建模中的gensim软件包配备了Porter Stemmer算法：

>>> from gensim import parsing
>>> gensim.parsing.stem_text("trying writing nonsense")
'try write nonsens'

gensim中实现的唯一词干提取选项是PorterStemmer。

顺便提一句：我可以想象（没有更多参考资料），大多数文本挖掘相关模块都有自己的实现，用于简单的预处理过程，如Porter的词干提取、空格删除和停用词移除。

- KenHBS

1

PyStemmer是Snowball词干提取库的Python接口。

文档可以在这里找到： https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart.txt https://github.com/snowballstem/pystemmer/blob/master/docs/quickstart_python3.txt

- Brice M. Dempsey

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ditkin · Accepted Answer

33

你可以尝试使用NLTK

>>> from nltk import PorterStemmer
>>> PorterStemmer().stem('complications')

- ditkin

波特词干提取器不是在1980年代开发的吗？肯定有更先进的选项吧？ - kalu

2

你是正确的，还有其他的词干提取器。从Natural Language Processing with Python section on stemmers节的预览中，他们简单比较了兰开斯特和波特，并指出：“词干提取不是一个明确定义的过程，我们通常选择最适合我们所需应用的词干提取器。如果你正在索引一些文本并希望支持使用单词的替代形式进行搜索，则波特词干提取器是一个不错的选择。” - ditkin