下面的
这篇文章展示了单个单词的字符n-gram提取,具体内容可参考Quick implementation of character n-grams using python。但如果我有句子并想提取字符n-gram,除了反复调用
[I]:
word2ngrams
函数从一个单词中提取字符3元组:>>> x = 'foobar'
>>> n = 3
>>> [x[i:i+n] for i in range(len(x)-n+1)]
['foo', 'oob', 'oba', 'bar']
这篇文章展示了单个单词的字符n-gram提取,具体内容可参考Quick implementation of character n-grams using python。但如果我有句子并想提取字符n-gram,除了反复调用
word2ngram()
之外,是否有更快的方法?
如何使用正则表达式实现相同的word2ngram
和sent2ngram
输出?速度会更快吗?
我尝试过:
import string, random, time
from itertools import chain
def word2ngrams(text, n=3):
""" Convert word into character ngrams. """
return [text[i:i+n] for i in range(len(text)-n+1)]
def sent2ngrams(text, n=3):
return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))
def sent2ngrams_simple(text, n=3):
text = text.lower()
return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]
# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]
start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start
start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start
print x==y
[out]:
0.0205280780792
0.0271739959717
True
编辑
正则表达式方法看起来很优雅,但它的性能比迭代调用 word2ngram()
慢:
import string, random, time, re
from itertools import chain
def word2ngrams(text, n=3):
""" Convert word into character ngrams. """
return [text[i:i+n] for i in range(len(text)-n+1)]
def sent2ngrams(text, n=3):
return list(chain(*[word2ngrams(i,n) for i in text.lower().split()]))
def sent2ngrams_simple(text, n=3):
text = text.lower()
return [text[i:i+n] for i in range(len(text)-n+1) if not " " in text[i:i+n]]
def sent2ngrams_regex(text, n=3):
rgx = '(?=('+'\S'*n+'))'
return re.findall(rgx,text)
# Generate 10000 random strings of length 100.
sents = [" ".join([''.join(random.choice(string.ascii_uppercase) for j in range(10)) for i in range(100)]) for k in range(100)]
start = time.time()
x = [sent2ngrams(i) for i in sents]
print time.time() - start
start = time.time()
y = [sent2ngrams_simple(i) for i in sents]
print time.time() - start
start = time.time()
z = [sent2ngrams_regex(i) for i in sents]
print time.time() - start
print x==y==z
[I]:
0.0211708545685
0.0284190177917
0.0303599834442
True
(?=(...))
是什么?你能给一个工作示例吗?我尝试过:(?=('foobar'))
但是出现了语法错误。 - alvas[i for i in re.findall(r'(?=(...))','foobar like') if not " " in i]
- alvas(?=(\S\S\S))
。 - user557597rgx =
应该只编译一次,而不是每个句子都编译一次。在迭代之前应该进行预编译。如果您主动移动匹配位置,也可以将正则表达式的速度提高10-15%。例如,/(?=(\S\S\S))./
添加 Dot-All 修饰符(与/(?=(\S\S\S))[\S\s]/
或/(?s)(?=(\S\S\S))./
相同)。 - user557597