计算多个字典之间的相似度“分数”

Question

计算多个字典之间的相似度“分数”

4

我有一个参考字典“ dictA”，需要将其与现场生成的 n 个字典进行比较（计算键和值之间的相似度）。每个字典的长度都相同。为了讨论方便，假设要与之比较的 n 个字典是 dictB、dictC 和 dictD。

以下是 dictA 的样式：

dictA={'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}

以下是dictB、dictC和dictD的样子：

dictB={'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"}
dictC={'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"}
dictD={'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}

我有一个解决方案，但只适用于两个词典的选项：

sharedValue = set(dictA.items()) & set(dictD.items())
dictLength = len(dictA)
scoreOfSimilarity = len(sharedValue)
similarity = scoreOfSimilarity/dictLength

我的问题是：如何遍历n个字典，其中dictA是我与其他字典进行比较的主要字典。目标是获取每个我将要遍历的字典与主要字典之间的“相似度”值。

谢谢您的帮助。

- lechiffre

这些字典是否存在于某个列表中？
如何计算多次迭代的相似度分数（例如平均值）？

- UltraInstinct

为什么不直接循环遍历从B到D的字典列表呢？您在解决这个问题时是否有特定的性能或数据结构限制要满足？ - Rahul Murmuria

1

只是让你知道，Python3中的dict.items()已经可以使用&和其他集合运算符了。它不是一个列表，而是一个字典项的视图，类似于集合对象。 - juanpa.arrivillaga

@SuperSaiyan - 1）是的，列表将始终在输入中。字典数量可以随机。有时，它可能像示例中一样是3个，而在其他情况下，可能会有100个需要进行比较的字典。2）不确定是否理解正确 :/ - lechiffre

@RahulMurmuria - 我正在寻找更快的执行者，因为我未来预计会有成千上万个字典。也许字典并不是性能最好的选择。你有什么建议？ - lechiffre

@lechiffre，我已经发布了一个答案。请注意变量命名约定。你所使用的来自Java，而对于Python，命名约定有些不同。 - Rahul Murmuria

4个回答

1

根据您的问题设置，似乎没有替代方案可以遍历输入的字典列表。但是，这里有一个可以应用的多进程技巧。

以下是您的输入：

dict_a = {'1': "U", '2': "D", '3': "D", '4': "U", '5': "U", '6': "U"}
dict_b = {'1': "U", '2': "U", '3': "D", '4': "D", '5': "U", '6': "D"}
dict_c = {'1': "U", '2': "U", '3': "U", '4': "D", '5': "U", '6': "D"}
dict_d = {'1': "D", '2': "U", '3': "U", '4': "U", '5': "D", '6': "D"}
other_dicts = [dict_b, dict_c, dict_d]

我已经将@gary_fixler的地图技术包含为similarity1，除此之外还有similarity2函数，我将用它来进行循环技术。

def similarity1(a):
    def _(b):
        shared_value = set(a.items()) & set(b.items())
        dict_length = len(a)
        score_of_similarity = len(shared_value)
        return score_of_similarity / dict_length
    return _

def similarity2(c):
    a, b = c
    shared_value = set(a.items()) & set(b.items())
    dict_length = len(a)
    score_of_similarity = len(shared_value)
    return score_of_similarity / dict_length

我们在这里评估3种技术：
（1）@gary_fixler的地图
（2）对字典列表进行简单循环
（3）对字典列表进行多进程处理

以下是执行语句：

print(list(map(similarity1(dict_a), other_dicts)))
print([similarity2((dict_a, dict_v)) for dict_v in other_dicts])

max_processes = int(multiprocessing.cpu_count()/2-1)
pool = multiprocessing.Pool(processes=max_processes)
print([x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))])

你会发现所有三种技术都产生相同的结果：

[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]
[0.5, 0.3333333333333333, 0.16666666666666666]

请注意，对于多进程处理，您拥有 multiprocessing.cpu_count()/2 个核心（每个核心都具有超线程）。假设您的系统上没有其他运行程序，并且您的程序没有 I/O 或同步需求（这是我们问题的情况），则通常使用 multiprocessing.cpu_count()/2-1 个进程可以获得最佳性能，-1 是为了父进程。

现在，来计时这三种技术：

print(timeit.timeit("list(map(similarity1(dict_a), other_dicts))",
                    setup="from __main__ import similarity1, dict_a, other_dicts", 
                    number=10000))

print(timeit.timeit("[similarity2((dict_a, dict_v)) for dict_v in other_dicts]",
                    setup="from __main__ import similarity2, dict_a, other_dicts", 
                    number=10000))

print(timeit.timeit("[x for x in pool.map(similarity2, zip(itertools.repeat(dict_a), other_dicts))]",
                    setup="from __main__ import similarity2, dict_a, other_dicts, pool", 
                    number=10000))

这在我的笔记本电脑上产生了以下结果：

0.07092539698351175
0.06757041101809591
1.6528456939850003

您可以看到基本循环技术表现最佳。由于创建进程和传递数据的开销，多进程比其他两种技术明显差。这并不意味着在此处 multiprocessing 无用。相反地，请看更大数量的输入字典的结果：

for _ in range(7):
    other_dicts.extend(other_dicts)

这将字典列表扩展到384个项目。以下是此输入的计时结果：

7.934810006991029
8.184540337068029
7.466550623998046

对于任何较大的输入字典集，多进程技术成为最优选择。

- Rahul Murmuria

0

如果你把你的解决方案放在一个函数中，你可以通过名称调用它来处理任何两个字典。此外，如果你通过将参数分解到嵌套函数中对函数进行柯里化，你可以部分应用第一个字典，以获得一个只需要第二个字典的函数（或者你可以使用functools.partial），这使得映射变得容易：

def similarity (a):
    def _ (b):
        sharedValue = set(a.items()) & set(b.items())
        dictLength = len(a)
        scoreOfSimilarity = len(sharedValue)
        return scoreOfSimilarity/dictLength
    return _

另外：上述内容也可以通过嵌套的lambda表达式写成单个表达式：

similarity = lambda a: lambda b: len(set(a.items()) & set(b.items)) / len(a)

现在你可以通过一个映射获取字典A和余数之间的相似度：

otherDicts = [dictB, dictC, dictD]
scores = map(similarity(dictA), otherdicts)

现在你可以使用min()（或max()，或其他）从分数列表中获取最佳结果：

winner = min(scores)

警告：我没有测试过以上任何内容。

- Gary Fixler

请勿使用“_”作为函数名称，即使它是内部函数。https://dev59.com/AW025IYBdhLWcg3wpHh- - lejlot

0

感谢大家参与问题的回答。这里是我需要的结果：

def compareTwoDictionaries(self, absolute, reference, listOfDictionaries):
    #look only for absolute fit, yes or no
    if (absolute == True):
        similarity = reference == listOfDictionaries
    else:
        #return items that are the same between two dictionaries
        shared_items = set(reference.items()) & set(listOfDictionaries.items())
        #return the length of the dictionary for further calculation of %
        dictLength = len(reference)
        #return the length of shared_items for further calculation of %
        scoreOfSimilarity = len(shared_items)
        #return final score: similarity
        similarity = scoreOfSimilarity/dictLength
    return similarity

这里是函数的调用

for dict in victim_list:
                output = oandaConnectorCalls.compareTwoDictionaries(False, reference, dict)

"

“Reference”字典和“victim_list”字典如上所述被使用。

"

- lechiffre

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Prune · Accepted Answer

这是一个通用的结构 —— 假设您可以独立生成字典，使用每个字典来生成下一个。这听起来就是您想要的。calculate_similarity将是包含上述“我有一个解决方案”代码的函数。

reference = {'1':"U", '2':"D", '3':"D", '4':"U", '5':"U",'6':"U"}
while True:
    on_the_spot = generate_dictionary()
    if on_the_spot is None:
        break
    calculate_similarity(reference, on_the_spot)

如果您需要遍历已生成的字典，则必须将它们存储在可迭代的Python结构中。在生成它们时，创建一个字典列表：

victim_list = [
    {'1':"U", '2':"U", '3':"D", '4':"D", '5':"U",'6':"D"},
    {'1':"U", '2':"U", '3':"U", '4':"D", '5':"U",'6':"D"},
    {'1':"D", '2':"U", '3':"U", '4':"U", '5':"D",'6':"D"}
]
for on_the_spot in victim_list:
    # Proceed as above

你是否熟悉Python中的生成器构造？它类似于一个使用yield而不是return返回其值的函数。如果是这样，请使用它来替代上面的列表。