我有一个包含50,000个字符串(城市名称)的清单,需要找到最小的字符三元组(最好是n-gram),使得每个字符串都至少被一个三元组命中。考虑以下列表: ['amsterdam', 'rotterdam', 'haarlem', 'utrecht', 'groningen']
识别三元组的列表长度为4,并且应该是以下内容(也可能有其他选择):
['ter', 'haa', 'utr', 'gro']
我认为我的解决方案找到了正确答案,但是在使用其他列表时它给出了错误的答案。
from collections import Counter
def identifying_grams(list, n=3):
def f7(seq):
seen = set()
seen_add = seen.add
return [x for x in seq if not (x in seen or seen_add(x))]
def ngrams(text, n=3):
return [text[i:i + n] for i in range(len(text) - n + 1)]
hits = []
trigrams = []
for item in list:
# trigrams += ngrams(item)
trigrams += f7(ngrams(item))
counts = Counter(trigrams).most_common()
for trigram, count in counts:
items = []
for item in list:
if trigram in item:
hits.append(trigram)
items.append(item)
for i in items:
list.remove(i)
return(f7(hits))
list1 = ['amsterdam','rotterdam','haarlem','utrecht','groningen']
print(identifying_grams(list1))
# Good, we get: ['ter', 'haa', 'utr', 'gro']
list2 = ['amsterdam','schiedam']
print(identifying_grams(list2))
# Good, we get: ['dam']
list3 = ['amsterdam','schiedam','terwolde','wolstad']
print(identifying_grams(list3))
# Ouch, we get: ['ter', 'dam', 'wol']
# this should be ['dam', 'wol'] as this is only 2 trigrams that identify the list...
到目前为止我已经得到两个答案,但是它们两个都有缺陷。Rupesh提供的一个对于小于10项的列表是不错的。我的列表有超过50000项。mujjiga的解决方案虽然不是完美的,但也能解决问题。
求Python大神提供一个完美而且可扩展的解决方案,如果运行效果良好并且每次运行都提供相同的解决方案,额外奖励!
I need a the smallest list of character tri-grams (prefarably n-grams) so that every string is at least once hit by every tri-gram
how is 'ter' a solution if it is not there inhaarlem
- mujjigaidentifying_grams(list3)
,输出结果为['dam', 'wol']
。 - גלעד ברקן