我有同样的需求,虽然lemur具有摘要功能,但我发现它存在缺陷,无法使用。上周末,我使用nltk在python中编写了一个总结模块:
https://github.com/thavelick/summarize
我从Java库Classifier4J中采用了算法:
http://classifier4j.sourceforge.net/,但尽可能使用了nltk和Python。
以下是基本用法:
>>> import summarize
一个SimpleSummarizer(目前唯一的摘要工具)通过使用最常见的单词来生成摘要句。
>>> ss = summarize.SimpleSummarizer()
>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text.'
你可以按照自己的喜好,在摘要中指定任意数量的句子。
>>> input = "NLTK is a python library for working human-written text. Summarize is a package that uses NLTK to create summaries. A Summariser is really cool. I don't think there are any other python summarisers."
>>> ss.summarize(input, 2)
"NLTK is a python library for working human-written text. I don't think there are any other python summarisers."
与Classifier4J的原始算法不同,这个摘要器可以正确处理除句号以外的标点符号:
>>> input = "NLTK is a python library for working human-written text! Summarize is a package that uses NLTK to create summaries."
>>> ss.summarize(input, 1)
'NLTK is a python library for working human-written text!'
更新
我现在(终于!)以Apache 2.0许可证发布了这个模块,与nltk相同,并将该模块放在github上(请参见上文)。欢迎任何贡献或建议。