如何将整个句子的语言模型得分与句子成分的得分相关联

Question

如何将整个句子的语言模型得分与句子成分的得分相关联

4

我在约5000个英语句子/段落上训练了一个KENLM语言模型。我想用两个或更多的片段查询此ARPA模型，并查看它们是否可以连接以形成更长的句子，希望更符合语法规则。以下是我使用的Python代码，用于获取片段和"句子"的对数分数和以十为底的幂值。我提供了两个例子。显然，第一个例子中的句子比第二个例子中的句子更通顺。然而，我的问题不在于此，而在于如何将整个句子的语言模型得分与其组成部分的得分相关联。也就是说，如果该句子在语法上比其组成部分更好，该怎么办。

import math
import kenlm as kl
model = kl.LanguageModel(r'D:\seg.arpa.bin')
print ('************')
sentence = 'Mr . Yamada was elected Chairperson of'
print(sentence)
p1=model.score(sentence)
p2=math.pow(10,p1)
print(p1)
print(p2)
sentence = 'the Drafting Committee by acclamation .'
print(sentence)
p3=model.score(sentence)
p4=math.pow(10,p3)
print(p3)
print(p4)
sentence = 'Mr . Yamada was elected Chairperson of the Drafting Committee by acclamation .'
print(sentence)
p5=model.score(sentence)
p6=math.pow(10,p5)
print(p5)
print(p6)
print ('-------------')
sentence = 'Cases cited in the present volume ix'
print(sentence)
p1=model.score(sentence)
p2=math.pow(10,p1)
print(p1)
print(p2)
sentence = 'Multilateral instruments cited in the present volume xiii'
print(sentence)
p3=model.score(sentence)
p4=math.pow(10,p3)
print(p3)
print(p4)
sentence = 'Cases cited in the present volume ix Multilateral instruments cited in the present volume xiii'
print(sentence)
p5=model.score(sentence)
p6=math.pow(10,p5)
print(p5)
print(p6)

************ 山田先生当选为起草委员会主席，全场无异议。 -34.0706558228 8.49853715087e-35 山田先生当选为起草委员会主席，全场无异议。 -28.3745193481 4.22163470933e-29 本卷引用案例共计九个。 -27.7353248596 1.83939558773e-28 本卷引用的多边文书共计十三个。 -34.4523620605 3.52888852435e-35 本卷引用案例共计九个，引用的多边文书共计十三个。 -60.7075233459 1.9609957573e-61

- Wei JIANG

也许你需要一个解析器。至少ngram模型的语言模型不能捕捉语法正确性。然而，循环神经语言模型在这个方向上展现了一些有趣的特性。 - user3639557

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- SilentFlame · Accepted Answer

使用 `

` 标签。

list(model.full_scores(sent))

返回句子成分即单词的详细信息。这将返回一个列表，并迭代以访问每个单词的详细信息。每个列表项包含上述返回的对于句子中每个单词的对数概率、ngram长度以及该单词是否为OOV（词汇表外）。