nltk的wordpunct_tokenize与word_tokenize比较

Question

nltk的wordpunct_tokenize与word_tokenize比较

22

有人知道nltk的wordpunct_tokenize和word_tokenize之间的区别吗？我正在使用nltk=3.2.4，wordpunct_tokenize的doc string中没有解释差异的内容。在nltk文档中也找不到这个信息（也许我没有在正确的地方搜索！）。我本来期望第一个函数会去掉标点符号或类似的标记，但它并没有这样做。

- tsando

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- xdurch0 · Accepted Answer

wordpunct_tokenize 是基于简单正则表达式的分词方法。其定义如下：

wordpunct_tokenize = WordPunctTokenizer().tokenize

您可以在此处找到WordPunctTokenizer。基本上，它使用正则表达式\w+|[^\w\s]+来分割输入内容。

另一方面，word_tokenize基于TreebankWordTokenizer，请参见文档此处。它基本上像Penn Treebank一样对文本进行标记化。这里有一个愚蠢的例子，应该说明两者的区别。

sent = "I'm a dog and it's great! You're cool and Sandy's book is big. Don't tell her, you'll regret it! 'Hey', she'll say!"
>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 'tell',
 'her', ',', 'you', "'ll", 'regret', 'it', '!', "'Hey", "'", ',', 'she', "'ll", 'say', '!']
>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'",
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don',
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', "'", 
 'Hey', "',", 'she', "'", 'll', 'say', '!']

我们可以看到，wordpunct_tokenize 函数会在所有特殊符号处进行分割，并将它们作为独立的单元处理。而 word_tokenize 则会保留像 're 这样的内容。但是它似乎并不太智能，因为我们可以看到它未能将初始单引号与 'Hey' 分开。

有趣的是，如果我们将句子写成这样（使用单引号作为字符串定界符，双引号包围 "Hey"）：

sent = 'I\'m a dog and it\'s great! You\'re cool and Sandy\'s book is big. Don\'t tell her, you\'ll regret it! "Hey", she\'ll say!'

我们获得

>>> word_tokenize(sent)
['I', "'m", 'a', 'dog', 'and', 'it', "'s", 'great', '!', 'You', "'re", 
 'cool', 'and', 'Sandy', "'s", 'book', 'is', 'big', '.', 'Do', "n't", 
 'tell', 'her', ',', 'you', "'ll", 'regret', 'it', '!', '``', 'Hey', "''", 
 ',', 'she', "'ll", 'say', '!']

所以，word_tokenize 会分离双引号，但它也将它们转换为 `` 和 ''。而 wordpunct_tokenize 则不会这样做：

>>> wordpunct_tokenize(sent)
['I', "'", 'm', 'a', 'dog', 'and', 'it', "'", 's', 'great', '!', 'You', "'", 
 're', 'cool', 'and', 'Sandy', "'", 's', 'book', 'is', 'big', '.', 'Don', 
 "'", 't', 'tell', 'her', ',', 'you', "'", 'll', 'regret', 'it', '!', '"', 
 'Hey', '",', 'she', "'", 'll', 'say', '!']