nltk word_tokenize: 为什么在进行单词分词之前要进行句子分词?

4

源代码中所述,word_tokenize在运行单词分词器(Treebank)之前会先运行句子分词器(Punkt):

# Standard word tokenizer.
_treebank_word_tokenizer = TreebankWordTokenizer()

def word_tokenize(text, language='english', preserve_line=False):
    """
    Return a tokenized copy of *text*,
    using NLTK's recommended word tokenizer
    (currently an improved :class:`.TreebankWordTokenizer`
    along with :class:`.PunktSentenceTokenizer`
    for the specified language).
    :param text: text to split into words
    :param text: str
    :param language: the model name in the Punkt corpus
    :type language: str
    :param preserve_line: An option to keep the preserve the sentence and not sentence tokenize it.
    :type preserver_line: bool
    """
    sentences = [text] if preserve_line else sent_tokenize(text, language)
    return [token for sent in sentences
            for token in _treebank_word_tokenizer.tokenize(sent)]

对于单词分词之前进行句子分词的好处是什么?

好问题!! - alvas
1个回答

3
NLTK中使用的默认分词器(nltk.word_tokenize)是TreebankWordTokenizer,它最初来自于Michael Heilman的tokenizer.sed。我们可以看到,在tokenizer.sed中,它声明了:
# Assume sentence tokenization has been done first, so split FINAL periods only. 
s=\([^.]\)\([.]\)\([])}>"']*\)[     ]*$=\1 \2\3 =g

这个正则表达式会始终分割最后一个句点,假设在此之前已经执行了句子标记化。按照树库标记器的要求,nltk.tokenize.treebank.TreebankWordTokenizer执行相同的正则操作,并在类文档字符串中记录其行为。
class TreebankWordTokenizer(TokenizerI):
    """
    The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank.
    This is the method that is invoked by ``word_tokenize()``.  It assumes that the
    text has already been segmented into sentences, e.g. using ``sent_tokenize()``.
    This tokenizer performs the following steps:
    - split standard contractions, e.g. ``don't`` -> ``do n't`` and ``they'll`` -> ``they 'll``
    - treat most punctuation characters as separate tokens
    - split off commas and single quotes, when followed by whitespace
    - separate periods that appear at the end of line
    """

更具体地说,“出现在行末的分隔符”指的是这个特定的正则表达式
# Handles the final period.
# NOTE: the second regex is the replacement during re.sub()
re.compile(r'([^\.])(\.)([\]\)}>"\']*)\s*$'), r'\1 \2\3 ')

在进行单词分词之前进行句子分词是常见的吗?

也许是,也许不是;这取决于您的任务以及如何评估该任务。如果我们查看其他单词分词器,我们会发现它们执行相同的最后一个句点拆分,例如在Moses(SMT)分词器中:

# Assume sentence tokenization has been done first, so split FINAL periods only.
$text =~ s=([^.])([.])([\]\)}>"']*) ?$=$1 $2$3 =g;

同样地,在 NLTK 版本的 Moses 分词器中:

# Splits final period at end of string.
FINAL_PERIOD = r"""([^.])([.])([\]\)}>"']*) ?$""", r'\1 \2\3'

此外,在toktok.pl及其NLTK端口中,对于不希望将其句子分割的用户,自v3.4以来提供了“preserve_line”选项。有关原因和更多信息,请参见https://github.com/nltk/nltk/issues/1699。

1
为什么对于这些单词分词器的开发来说,这个假设是必要的?有没有一篇论文解释其原因和动机? - jkarimi
看起来树库本身就是“注释句法或语义句子结构的结构...通常是在已经用词性标记注释过的语料库之上创建的”。因此,这些模型的训练不一定需要这个假设,但至少对于基于树库的模型而言,这些模型本身是围绕句子建立的。 - jkarimi
好问题!我也一直在问自己同样的问题,为什么在#nlproc中句子是默认值而不是段落或文档呢?如果我们换个角度,为什么不使用形态素呢?或者句子片段,例如https://github.com/google/sentencepiece - alvas

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接