根据 OP 提供的论文,以下字符串:
Insurgents killed in ongoing fighting
生成如下结果:
2-skip-bi-grams = {insurgents killed, insurgents in, insurgents
ongoing, killed in, killed ongoing, killed fighting, in ongoing, in
fighting, ongoing fighting}
2-skip-tri-grams = {insurgents killed in, insurgents killed ongoing,
insurgents killed fighting, insurgents in ongoing, insurgents in
fighting, insurgents ongoing fighting, killed in ongoing, killed in
fighting, killed ongoing fighting, in ongoing fighting}.
稍微修改了 NLTK 的ngrams
代码(https://github.com/nltk/nltk/blob/develop/nltk/util.py#L383):
from itertools import chain, combinations
import copy
from nltk.util import ngrams
def pad_sequence(sequence, n, pad_left=False, pad_right=False, pad_symbol=None):
if pad_left:
sequence = chain((pad_symbol,) * (n-1), sequence)
if pad_right:
sequence = chain(sequence, (pad_symbol,) * (n-1))
return sequence
def skipgrams(sequence, n, k, pad_left=False, pad_right=False, pad_symbol=None):
sequence_length = len(sequence)
sequence = iter(sequence)
sequence = pad_sequence(sequence, n, pad_left, pad_right, pad_symbol)
if sequence_length + pad_left + pad_right < k:
raise Exception("The length of sentence + padding(s) < skip")
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
history = []
nk = n+k
if nk < 1:
return
elif nk > sequence_length:
for ng in skipgrams(list(sequence), n, k-1):
yield ng
while nk > 1:
history.append(next(sequence))
nk -= 1
for item in sequence:
history.append(item)
current_token = history.pop(0)
for idx in list(combinations(range(len(history)), n-1)):
ng = [current_token]
for _id in idx:
ng.append(history[_id])
yield tuple(ng)
for ng in list(skipgrams(history, n, k-1)):
yield ng
让我们进行一些doctest,以匹配论文中的例子:
>>> two_skip_bigrams = list(skipgrams(text, n=2, k=2))
[('Insurgents', 'killed'), ('Insurgents', 'in'), ('Insurgents', 'ongoing'), ('killed', 'in'), ('killed', 'ongoing'), ('killed', 'fighting'), ('in', 'ongoing'), ('in', 'fighting'), ('ongoing', 'fighting')]
>>> two_skip_trigrams = list(skipgrams(text, n=3, k=2))
[('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
但请注意,如果n+k>len(sequence)
,它将产生与skipgrams(sequence,n,k-1)
相同的效果(这不是一个错误,而是一项故障安全功能),例如。
>>> three_skip_trigrams = list(skipgrams(text, n=3, k=3))
>>> three_skip_fourgrams = list(skipgrams(text, n=4, k=3))
>>> four_skip_fourgrams = list(skipgrams(text, n=4, k=4))
>>> four_skip_fivegrams = list(skipgrams(text, n=5, k=4))
>>>
>>> print len(three_skip_trigrams), three_skip_trigrams
10 [('Insurgents', 'killed', 'in'), ('Insurgents', 'killed', 'ongoing'), ('Insurgents', 'killed', 'fighting'), ('Insurgents', 'in', 'ongoing'), ('Insurgents', 'in', 'fighting'), ('Insurgents', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing'), ('killed', 'in', 'fighting'), ('killed', 'ongoing', 'fighting'), ('in', 'ongoing', 'fighting')]
>>> print len(three_skip_fourgrams), three_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fourgrams), four_skip_fourgrams
5 [('Insurgents', 'killed', 'in', 'ongoing'), ('Insurgents', 'killed', 'in', 'fighting'), ('Insurgents', 'killed', 'ongoing', 'fighting'), ('Insurgents', 'in', 'ongoing', 'fighting'), ('killed', 'in', 'ongoing', 'fighting')]
>>> print len(four_skip_fivegrams), four_skip_fivegrams
1 [('Insurgents', 'killed', 'in', 'ongoing', 'fighting')]
这使得 n == k
是允许的,但是 n > k
是不允许的,就像以下代码所示:
```代码部分保持不变```
if n < k:
raise Exception("Degree of Ngrams (n) needs to be bigger than skip (k)")
为了便于理解,让我们尝试理解这个“神秘”的行:
for idx in list(combinations(range(len(history)), n-1)):
pass
给定一个唯一项列表,组合生成如下内容:
>>> from itertools import combinations
>>> x = [0,1,2,3,4,5]
>>> list(combinations(x,2))
[(0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5)]
由于标记列表的索引始终是唯一的,例如
>>> sent = ['this', 'is', 'a', 'foo', 'bar']
>>> current_token = sent.pop(0)
>>> range(len(sent))
[0,1,2,3]
可以计算范围内所有可能的组合(无重复)
>>> n = 3
>>> list(combinations(range(len(sent)), n-1))
[(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]
如果我们将索引映射回令牌列表:
>>> [tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)
[('is', 'a'), ('is', 'foo'), ('is', 'bar'), ('a', 'foo'), ('a', 'bar'), ('foo', 'bar')]
然后我们将其与current_token
连接,得到当前标记和上下文+跳跃窗口的skipgrams:
>>> [tuple([current_token]) + tuple(sent[id] for id in idx) for idx in combinations(range(len(sent)), 2)]
[('this', 'is', 'a'), ('this', 'is', 'foo'), ('this', 'is', 'bar'), ('this', 'a', 'foo'), ('this', 'a', 'bar'), ('this', 'foo', 'bar')]
那么在此之后,我们继续进行下一个单词。