您可以使用新的
nltk.lm
模块。这里是一个例子,首先获取一些数据并进行分词:
import os
import requests
import io
from nltk import word_tokenize, sent_tokenize
if os.path.isfile('language-never-random.txt'):
with io.open('language-never-random.txt', encoding='utf8') as fin:
text = fin.read()
else:
url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
text = requests.get(url).content.decode('utf8')
with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
fout.write(text)
tokenized_text = [list(map(str.lower, word_tokenize(sent)))
for sent in sent_tokenize(text)]
接下来是语言建模:
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_sents)
获取计数:
要获取计数:
model.counts['language']
model.counts[['language']]['is']
model.counts[['language', 'is']]['never']
获取概率:
model.score('is', 'language'.split())
model.score('never', 'language is'.split())
在Kaggle平台加载笔记本时会有一些问题,但是该笔记本应该会很好地概述nltk.lm模块。
https://www.kaggle.com/alvations/n-gram-language-model-with-nltk。
pip
安装nltk.lm
。当我安装nltk
时似乎没有这个模块。 - Ahmadpip install -U nltk>=3.4
- alvasmodel.score(sentence)
函数?例如,如果我们使用回退,计算句子的分数并不是非常简单。 - Simone