对意大利语句子进行词形还原以进行频率统计

Question

对意大利语句子进行词形还原以进行频率统计

10

我想对一些意大利文本进行词形还原，以便对单词进行频率计数和进一步研究词形还原内容的输出。

我更喜欢词形还原而不是词干提取，因为我可以从句子上下文中提取单词的含义（例如区分动词和名词），并获得语言中存在的单词，而不是那些通常没有含义的单词根。

我发现了一个名为pattern的库（pip2 install pattern），它应该与nltk相结合，以执行意大利语的词形还原，但是我不确定下面的方法是否正确，因为每个单词都是单独进行词形还原的，而不是在句子的上下文中。

也许我应该让pattern负责对句子进行分词（因此还要注释每个单词的元数据，包括动词/名词/形容词等），然后检索词形还原的单词，但是我无法做到这一点，而且我现在也不确定是否可能？

此外：在意大利语中，一些冠词用撇号表示，因此例如“l'appartamento”（英语中的“the flat”）实际上是2个单词：“lo”和“appartamento”。目前我无法找到一种使用nltk和pattern组合的方法来拆分这两个单词，因此我无法正确计算单词的频率。

import nltk
import string
import pattern

# dictionary of Italian stop-words
it_stop_words = nltk.corpus.stopwords.words('italian')
# Snowball stemmer with rules for the Italian language
ita_stemmer = nltk.stem.snowball.ItalianStemmer()

# the following function is just to get the lemma
# out of the original input word (but right now
# it may be loosing the context about the sentence
# from where the word is coming from i.e.
# the same word could either be a noun/verb/adjective
# according to the context)
def lemmatize_word(input_word):
    in_word = input_word#.decode('utf-8')
    # print('Something: {}'.format(in_word))
    word_it = pattern.it.parse(
        in_word, 
        tokenize=False,  
        tag=False,  
        chunk=False,  
        lemmata=True 
    )
    # print("Input: {} Output: {}".format(in_word, word_it))
    the_lemmatized_word = word_it.split()[0][0][4]
    # print("Returning: {}".format(the_lemmatized_word))
    return the_lemmatized_word

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."

# 1st tokenize the sentence(s)
word_tokenized_list = nltk.tokenize.word_tokenize(it_string)
print("1) NLTK tokenizer, num words: {} for list: {}".format(len(word_tokenized_list), word_tokenized_list))

# 2nd remove punctuation and everything lower case
word_tokenized_no_punct = [string.lower(x) for x in word_tokenized_list if x not in string.punctuation]
print("2) Clean punctuation, num words: {} for list: {}".format(len(word_tokenized_no_punct), word_tokenized_no_punct))

# 3rd remove stop words (for the Italian language)
word_tokenized_no_punct_no_sw = [x for x in word_tokenized_no_punct if x not in it_stop_words]
print("3) Clean stop-words, num words: {} for list: {}".format(len(word_tokenized_no_punct_no_sw), word_tokenized_no_punct_no_sw))

# 4.1 lemmatize the words
word_tokenize_list_no_punct_lc_no_stowords_lemmatized = [lemmatize_word(x) for x in word_tokenized_no_punct_no_sw]
print("4.1) lemmatizer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_lemmatized), word_tokenize_list_no_punct_lc_no_stowords_lemmatized))

# 4.2 snowball stemmer for Italian
word_tokenize_list_no_punct_lc_no_stowords_stem = [ita_stemmer.stem(i) for i in word_tokenized_no_punct_no_sw]
print("4.2) stemmer, num words: {} for list: {}".format(len(word_tokenize_list_no_punct_lc_no_stowords_stem), word_tokenize_list_no_punct_lc_no_stowords_stem))

# difference between stemmer and lemmatizer
print(
    "For original word(s) '{}' and '{}' the stemmer: '{}' '{}' (count 1 each), the lemmatizer: '{}' '{}' (count 2)"
    .format(
        word_tokenized_no_punct_no_sw[1],
        word_tokenized_no_punct_no_sw[6],
        word_tokenize_list_no_punct_lc_no_stowords_stem[1],
        word_tokenize_list_no_punct_lc_no_stowords_stem[6],
        word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1],
        word_tokenize_list_no_punct_lc_no_stowords_lemmatized[1]
    )
)

给出以下输出：

1) NLTK tokenizer, num words: 20 for list: ['Ieri', 'sono', 'andato', 'in', 'due', 'supermercati', '.', 'Oggi', 'volevo', 'andare', "all'ippodromo", '.', 'Stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure', '.']
2) Clean punctuation, num words: 17 for list: ['ieri', 'sono', 'andato', 'in', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'la', 'pizza', 'con', 'le', 'verdure']
3) Clean stop-words, num words: 12 for list: ['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', "all'ippodromo", 'stasera', 'mangio', 'pizza', 'verdure']
4.1) lemmatizer, num words: 12 for list: [u'ieri', u'andarsene', u'due', u'supermercato', u'oggi', u'volere', u'andare', u"all'ippodromo", u'stasera', u'mangiare', u'pizza', u'verdura']
4.2) stemmer, num words: 12 for list: [u'ier', u'andat', u'due', u'supermerc', u'oggi', u'vol', u'andar', u"all'ippodrom", u'staser', u'mang', u'pizz', u'verdur']
For original word(s) 'andato' and 'andare' the stemmer: 'andat' 'andar' (count 1 each), the lemmatizer: 'andarsene' 'andarsene' (count 2)

如何使用pattern的分词器有效地对一些句子进行词形归并？（假设词形被识别为名词/动词/形容词等）
是否有Python替代pattern可用于使用nltk进行意大利语词形还原？
如何通过撇号将与下一个单词绑定的文章分割开？

- TPPZ

3个回答

3

我知道这个问题几年前就已经解决了，但是我在使用nltk分词和Python 3时遇到了与解析像all'ippodromo或dall'Italia这样的单词的相同问题。因此，我想分享我的经验并给出一个部分答案，尽管有些晚了。

NLP必须考虑的第一个行动/规则是准备语料库。所以我发现，通过在文本解析期间使用准确的正则表达式替换（或者只是在基本文本编辑器中一次进行适当的全部替换），将'字符替换为正确的重音符号’，然后仅使用nltk.tokenize.word_tokenize(text)就可以正确地进行分词，并获得正确的拆分。

- Leonardo

1

对于将来可能遇到这个问题的任何人，我使用了一种有效的方法来对非英语（在我这里是意大利语）的文本进行词形还原。 spaCy库提供了比NLTK更完整和复杂的文本分析功能。

我在最后的评论中附上了这个代码片段的结果。

首先，您需要安装该库以及所需语言的支持。

!pip install spacy
!python -m spacy download it_core_news_sm

在这之后，您可以对您想要的文本进行词形还原。

import spacy

# Load the italian model
nlp = spacy.load("it_core_news_sm")

# Sample text
testo = "Mi piaceva tanto programmare in Python. Tant'è che ho ripreso a farlo guardando dei video su YouTube"

# Analyze the text
doc = nlp(testo)

# Extract lemmas from the analysed words
lemmi = [token.lemma_ for token in doc]

# Print lemmas
print(lemmi)

# output: ['mi', 'piacere', 'tanto', 'programmare', 'in', 'Python', '.', 'Tanta', 'essere', 'che', 'avere', 'riprendere', 'a', 'fare lo', 'guardare', 'di il', 'video', 'su', 'YouTube']

- Federico Scaltriti

如果你只需要处理句子和词形还原，相比于nltk，spacy速度太慢了。 - undefined

@AntonioSesto，确实在加载模型时会慢一些，但这只发生在启动时。而且你只需要做一次。此外，它还执行了词形还原！使用nltk时，所提供的句子并没有进行词形还原；它无法发现单词的词形还原形式。相反，使用我的建议，输出将是：

['昨天',  '是',  '去',  '两个',  '超市',  '。',   '今天',  '想要',  '去',  '赛马场',  '。',   '今晚',  '吃',  '披萨',  '和',  '蔬菜']

。这才是正确的词形还原。 - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Adonis · Accepted Answer

我将尝试回答你的问题，但我对意大利语并不了解！

1）据我所知，消除撇号的主要责任是由分词器承担的。因此，nltk意大利语分词器似乎失败了。

3）你可以做的一个简单的事情就是调用replace方法（尽管你可能需要使用re包处理更复杂的模式），例如：

word_tokenized_no_punct_no_sw_no_apostrophe = [x.split("'") for x in word_tokenized_no_punct_no_sw]
word_tokenized_no_punct_no_sw_no_apostrophe = [y for x in word_tokenized_no_punct_no_sw_no_apostrophe for y in x]

它产生：

['ieri', 'andato', 'due', 'supermercati', 'oggi', 'volevo', 'andare', 'all', 'ippodromo', 'stasera', 'mangio', 'pizza', 'verdure']

2) 除了使用pattern，还可以使用 treetagger ，但这并不是所有工具中最简单的安装（您需要安装Python包和工具本身），不过在安装完成后它可以在windows和Linux上使用。

以下是一个简单的示例：

import treetaggerwrapper 
from pprint import pprint

it_string = "Ieri sono andato in due supermercati. Oggi volevo andare all'ippodromo. Stasera mangio la pizza con le verdure."
tagger = treetaggerwrapper.TreeTagger(TAGLANG="it")
tags = tagger.tag_text(it_string)
pprint(treetaggerwrapper.make_tags(tags))

pprint会输出以下内容：

[Tag(word=u'Ieri', pos=u'ADV', lemma=u'ieri'),
 Tag(word=u'sono', pos=u'VER:pres', lemma=u'essere'),
 Tag(word=u'andato', pos=u'VER:pper', lemma=u'andare'),
 Tag(word=u'in', pos=u'PRE', lemma=u'in'),
 Tag(word=u'due', pos=u'ADJ', lemma=u'due'),
 Tag(word=u'supermercati', pos=u'NOM', lemma=u'supermercato'),
 Tag(word=u'.', pos=u'SENT', lemma=u'.'),
 Tag(word=u'Oggi', pos=u'ADV', lemma=u'oggi'),
 Tag(word=u'volevo', pos=u'VER:impf', lemma=u'volere'),
 Tag(word=u'andare', pos=u'VER:infi', lemma=u'andare'),
 Tag(word=u"all'", pos=u'PRE:det', lemma=u'al'),
 Tag(word=u'ippodromo', pos=u'NOM', lemma=u'ippodromo'),
 Tag(word=u'.', pos=u'SENT', lemma=u'.'),
 Tag(word=u'Stasera', pos=u'ADV', lemma=u'stasera'),
 Tag(word=u'mangio', pos=u'VER:pres', lemma=u'mangiare'),
 Tag(word=u'la', pos=u'DET:def', lemma=u'il'),
 Tag(word=u'pizza', pos=u'NOM', lemma=u'pizza'),
 Tag(word=u'con', pos=u'PRE', lemma=u'con'),
 Tag(word=u'le', pos=u'DET:def', lemma=u'il'),
 Tag(word=u'verdure', pos=u'NOM', lemma=u'verdura'),
 Tag(word=u'.', pos=u'SENT', lemma=u'.')]

在进行词形还原之前，它还很好地将all'ippodromo标记为al和ippodromo（希望是正确的）。现在我们只需要应用停用词和标点符号的去除，就可以了。

安装TreeTaggerWrapper库的Python文档