实用的NLTK使用示例

78
我正在尝试使用自然语言工具包(Natural Language Toolkit,NLTK)。
它的文档(BookHOWTO)相当冗长,而且示例有时稍微有些高级。
有没有关于NLTK用法/应用的好的基础示例?我在想像Stream Hacker博客上的NTLK文章那样的东西。
3个回答

28

以下是我自己的实际例子,为了让其他人更好地理解这个问题(请原谅我的示例文本,它是我在维基百科上找到的第一件事):

import nltk
import pprint

tokenizer = None
tagger = None

def init_nltk():
    global tokenizer
    global tagger
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
    tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())

def tag(text):
    global tokenizer
    global tagger
    if not tokenizer:
        init_nltk()
    tokenized = tokenizer.tokenize(text)
    tagged = tagger.tag(tokenized)
    tagged.sort(lambda x,y:cmp(x[1],y[1]))
    return tagged

def main():
    text = """Mr Blobby is a fictional character who featured on Noel
    Edmonds' Saturday night entertainment show Noel's House Party,
    which was often a ratings winner in the 1990s. Mr Blobby also
    appeared on the Jamie Rose show of 1997. He was designed as an
    outrageously over the top parody of a one-dimensional, mute novelty
    character, which ironically made him distinctive, absurd and popular.
    He was a large pink humanoid, covered with yellow spots, sporting a
    permanent toothy grin and jiggling eyes. He communicated by saying
    the word "blobby" in an electronically-altered voice, expressing
    his moods through tone of voice and repetition.

    There was a Mrs. Blobby, seen briefly in the video, and sold as a
    doll.

    However Mr Blobby actually started out as part of the 'Gotcha'
    feature during the show's second series (originally called 'Gotcha
    Oscars' until the threat of legal action from the Academy of Motion
    Picture Arts and Sciences[citation needed]), in which celebrities
    were caught out in a Candid Camera style prank. Celebrities such as
    dancer Wayne Sleep and rugby union player Will Carling would be
    enticed to take part in a fictitious children's programme based around
    their profession. Mr Blobby would clumsily take part in the activity,
    knocking over the set, causing mayhem and saying "blobby blobby
    blobby", until finally when the prank was revealed, the Blobby
    costume would be opened - revealing Noel inside. This was all the more
    surprising for the "victim" as during rehearsals Blobby would be
    played by an actor wearing only the arms and legs of the costume and
    speaking in a normal manner.[citation needed]"""
    tagged = tag(text)    
    l = list(set(tagged))
    l.sort(lambda x,y:cmp(x[1],y[1]))
    pprint.pprint(l)

if __name__ == '__main__':
    main()

输出:

[('rugby', None),
 ('Oscars', None),
 ('1990s', None),
 ('",', None),
 ('Candid', None),
 ('"', None),
 ('blobby', None),
 ('Edmonds', None),
 ('Mr', None),
 ('outrageously', None),
 ('.[', None),
 ('toothy', None),
 ('Celebrities', None),
 ('Gotcha', None),
 (']),', None),
 ('Jamie', None),
 ('humanoid', None),
 ('Blobby', None),
 ('Carling', None),
 ('enticed', None),
 ('programme', None),
 ('1997', None),
 ('s', None),
 ("'", "'"),
 ('[', '('),
 ('(', '('),
 (']', ')'),
 (',', ','),
 ('.', '.'),
 ('all', 'ABN'),
 ('the', 'AT'),
 ('an', 'AT'),
 ('a', 'AT'),
 ('be', 'BE'),
 ('were', 'BED'),
 ('was', 'BEDZ'),
 ('is', 'BEZ'),
 ('and', 'CC'),
 ('one', 'CD'),
 ('until', 'CS'),
 ('as', 'CS'),
 ('This', 'DT'),
 ('There', 'EX'),
 ('of', 'IN'),
 ('inside', 'IN'),
 ('from', 'IN'),
 ('around', 'IN'),
 ('with', 'IN'),
 ('through', 'IN'),
 ('-', 'IN'),
 ('on', 'IN'),
 ('in', 'IN'),
 ('by', 'IN'),
 ('during', 'IN'),
 ('over', 'IN'),
 ('for', 'IN'),
 ('distinctive', 'JJ'),
 ('permanent', 'JJ'),
 ('mute', 'JJ'),
 ('popular', 'JJ'),
 ('such', 'JJ'),
 ('fictional', 'JJ'),
 ('yellow', 'JJ'),
 ('pink', 'JJ'),
 ('fictitious', 'JJ'),
 ('normal', 'JJ'),
 ('dimensional', 'JJ'),
 ('legal', 'JJ'),
 ('large', 'JJ'),
 ('surprising', 'JJ'),
 ('absurd', 'JJ'),
 ('Will', 'MD'),
 ('would', 'MD'),
 ('style', 'NN'),
 ('threat', 'NN'),
 ('novelty', 'NN'),
 ('union', 'NN'),
 ('prank', 'NN'),
 ('winner', 'NN'),
 ('parody', 'NN'),
 ('player', 'NN'),
 ('actor', 'NN'),
 ('character', 'NN'),
 ('victim', 'NN'),
 ('costume', 'NN'),
 ('action', 'NN'),
 ('activity', 'NN'),
 ('dancer', 'NN'),
 ('grin', 'NN'),
 ('doll', 'NN'),
 ('top', 'NN'),
 ('mayhem', 'NN'),
 ('citation', 'NN'),
 ('part', 'NN'),
 ('repetition', 'NN'),
 ('manner', 'NN'),
 ('tone', 'NN'),
 ('Picture', 'NN'),
 ('entertainment', 'NN'),
 ('night', 'NN'),
 ('series', 'NN'),
 ('voice', 'NN'),
 ('Mrs', 'NN'),
 ('video', 'NN'),
 ('Motion', 'NN'),
 ('profession', 'NN'),
 ('feature', 'NN'),
 ('word', 'NN'),
 ('Academy', 'NN-TL'),
 ('Camera', 'NN-TL'),
 ('Party', 'NN-TL'),
 ('House', 'NN-TL'),
 ('eyes', 'NNS'),
 ('spots', 'NNS'),
 ('rehearsals', 'NNS'),
 ('ratings', 'NNS'),
 ('arms', 'NNS'),
 ('celebrities', 'NNS'),
 ('children', 'NNS'),
 ('moods', 'NNS'),
 ('legs', 'NNS'),
 ('Sciences', 'NNS-TL'),
 ('Arts', 'NNS-TL'),
 ('Wayne', 'NP'),
 ('Rose', 'NP'),
 ('Noel', 'NP'),
 ('Saturday', 'NR'),
 ('second', 'OD'),
 ('his', 'PP$'),
 ('their', 'PP$'),
 ('him', 'PPO'),
 ('He', 'PPS'),
 ('more', 'QL'),
 ('However', 'RB'),
 ('actually', 'RB'),
 ('also', 'RB'),
 ('clumsily', 'RB'),
 ('originally', 'RB'),
 ('only', 'RB'),
 ('often', 'RB'),
 ('ironically', 'RB'),
 ('briefly', 'RB'),
 ('finally', 'RB'),
 ('electronically', 'RB-HL'),
 ('out', 'RP'),
 ('to', 'TO'),
 ('show', 'VB'),
 ('Sleep', 'VB'),
 ('take', 'VB'),
 ('opened', 'VBD'),
 ('played', 'VBD'),
 ('caught', 'VBD'),
 ('appeared', 'VBD'),
 ('revealed', 'VBD'),
 ('started', 'VBD'),
 ('saying', 'VBG'),
 ('causing', 'VBG'),
 ('expressing', 'VBG'),
 ('knocking', 'VBG'),
 ('wearing', 'VBG'),
 ('speaking', 'VBG'),
 ('sporting', 'VBG'),
 ('revealing', 'VBG'),
 ('jiggling', 'VBG'),
 ('sold', 'VBN'),
 ('called', 'VBN'),
 ('made', 'VBN'),
 ('altered', 'VBN'),
 ('based', 'VBN'),
 ('designed', 'VBN'),
 ('covered', 'VBN'),
 ('communicated', 'VBN'),
 ('needed', 'VBN'),
 ('seen', 'VBN'),
 ('set', 'VBN'),
 ('featured', 'VBN'),
 ('which', 'WDT'),
 ('who', 'WPS'),
 ('when', 'WRB')]

7
这是做什么的?你能加一些描述吗?另外为什么要使用全局变量,你本可以直接使用它们对吧。 - avi
1
@avi 它正在为单词生成词性标记(向下滚动以查看完整列表)。例如:('called', 'VBN') 表示 called 是一个 过去分词动词。看起来使用了 Global,这样变量就可以在函数的作用域内更改(这样每次调用函数时就不必传递它们)。 - e h
1
为Blobby先生点赞1 - Aphire

18

NLP(自然语言处理)一般非常有用,因此您可能希望将搜索范围扩大到文本分析的一般应用。我使用了NLTK来协助MOSS 2010,通过提取概念图生成文件分类法。它非常有效。在文件开始以有用的方式聚类之前,并不需要花费太长时间。

通常情况下,要理解文本分析,您必须沿着您习惯思考的方式进行思考。例如,文本分析对于发现非常有用。然而,大多数人甚至不知道搜索和发现之间的区别。如果您阅读这些主题,您很可能会“发现”可以利用NLTK的方式。

此外,考虑一下没有NLTK的文本文件的世界观。您有一堆由空格和标点符号分隔的随机长度字符串。一些标点符号改变了它的用法,例如句点(它也是小数点和缩写词后缀标记)。使用NLTK,您可以获得单词,更重要的是您可以获得词性。现在您对内容有了一个把握。使用NLTK发现文档中的概念和操作。使用NLTK获取文档的“含义”。在这种情况下,“含义”指的是文档中的基本关系。

对于NLTK感到好奇是一件好事。在未来的几年中,文本分析将有很大发展。那些了解它的人将更适合利用新机会。


你能否提供 MOSS 2010 参考链接? - alvas
我最好的链接是我几年前写的一篇论文。今年我将重建我的网页,专注于我的工作数据挖掘射电望远镜,但是这篇论文应该还会保留一段时间:http://www.nectarineimp.com/automated-folksonomy-whitepaper/ - Pete Mancini

14
我是streamhacker.com的作者(感谢提及,这个问题给我的网站带来了相当多的点击量)。你具体想做什么?NLTK有很多工具可以做不同的事情,但缺乏关于如何使用这些工具以及最佳实践的清晰信息。它也更加注重学术问题,因此将pedagogical示例转化为实际解决方案可能会很困难。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接