NLTK性能表现

9

最近我对自然语言处理非常感兴趣,但是我目前的大部分工作都是用C来完成的。我听说了NLTK,虽然我不懂Python,但它似乎很容易学习,而且看起来是一种非常强大和有趣的语言。特别是NLTK模块似乎非常适合我需要做的事情。

然而,当我使用NLTK示例代码并将其粘贴到名为test.py的文件中时,我注意到运行时间非常长!

我像这样从Shell中调用它:

time python ./test.py

在一台拥有4GB内存的2.4 GHz机器上,执行时间为19.187秒!

也许这是完全正常的,但我曾经认为NTLK是非常快的;也许我错了,但我在这里显然做错了什么明显的事情吗?


3
你从哪里得出NLTK非常快的印象的? - Fred Foo
在亚马逊的“Python Text Processing with NLTK 2.0”描述中:学习如何轻松处理大量数据,而不会损失效率或速度。(http://www.amazon.com/Python-Text-Processing-NLTK-Cookbook/dp/1849513600) - elliottbolzan
2个回答

19

我相信你把训练时间和处理时间混淆了。例如UnigramTagger这样的模型训练可能需要很长时间。从pickle文件中加载已经训练好的模型也会花费大量时间。但是一旦你将模型加载到内存中,处理速度就可以非常快。请查看我在使用NLTK进行词性标注文章底部的"分类器效率"部分,以了解不同标注算法的处理速度。


8

@Jacob关于将训练和标记时间混淆是正确的。我已经简化了示例代码,以下是时间分解:

Importing nltk takes 0.33 secs
Training time: 11.54 secs
Tagging time: 0.0 secs
Sorting time: 0.0 secs

Total time: 11.88 secs

系统:

CPU: Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
Memory: 3.7GB

代码:

import pprint, time
startstart = time.clock()

start = time.clock()
import nltk
print "Importing nltk takes", str((time.clock()-start)),"secs"

start = time.clock()
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
print "Training time:",str((time.clock()-start)),"secs"


text = """Mr Blobby is a fictional character who featured on Noel
Edmonds' Saturday night entertainment show Noel's House Party,
which was often a ratings winner in the 1990s. Mr Blobby also
appeared on the Jamie Rose show of 1997. He was designed as an
outrageously over the top parody of a one-dimensional, mute novelty
character, which ironically made him distinctive, absurd and popular.
He was a large pink humanoid, covered with yellow spots, sporting a
permanent toothy grin and jiggling eyes. He communicated by saying
the word "blobby" in an electronically-altered voice, expressing
his moods through tone of voice and repetition.

There was a Mrs. Blobby, seen briefly in the video, and sold as a
doll.

However Mr Blobby actually started out as part of the 'Gotcha'
feature during the show's second series (originally called 'Gotcha
Oscars' until the threat of legal action from the Academy of Motion
Picture Arts and Sciences[citation needed]), in which celebrities
were caught out in a Candid Camera style prank. Celebrities such as
dancer Wayne Sleep and rugby union player Will Carling would be
enticed to take part in a fictitious children's programme based around
their profession. Mr Blobby would clumsily take part in the activity,
knocking over the set, causing mayhem and saying "blobby blobby
blobby", until finally when the prank was revealed, the Blobby
costume would be opened - revealing Noel inside. This was all the more
surprising for the "victim" as during rehearsals Blobby would be
played by an actor wearing only the arms and legs of the costume and
speaking in a normal manner.[citation needed]"""

start = time.clock()
tokenized = tokenizer.tokenize(text)
tagged = tagger.tag(tokenized)
print "Tagging time:",str((time.clock()-start)),"secs"

start = time.clock()
tagged.sort(lambda x,y:cmp(x[1],y[1]))
print "Sorting time:",str((time.clock()-start)),"secs"

#l = list(set(tagged))
#pprint.pprint(l)
print
print "Total time:",str((time.clock()-startstart)),"secs"

1
很棒能够获取事实数据和代码回放! - Titou

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接