计算单词列表之间的相似度

5
我想计算两个单词列表之间的相似度,例如:
['email', 'user', 'this', 'email', 'address', 'customer']
与此列表相似:
['email', 'mail', 'address', 'netmail']
我想要比另一个列表有更高的相似度百分比,例如: ['address','ip','network'] 即使地址在列表中存在。

你对这个的期望输出是什么? - DirtyBit
暂无法完成此任务。 - Youness Drissi Slimani
你有查找余弦相似度吗? - DirtyBit
例如,两个单词完全匹配,1个单词几乎达到80-90%的匹配度,其余的不匹配,输出应该是什么? - DirtyBit
您还可以使用余弦相似度来比较单词。再次提醒,在您的情况下,输出应该是什么以及为什么? - DirtyBit
显示剩余3条评论
3个回答

13

因为你还没有真正能够展示一个清晰的输出,所以这是我最好的尝试:

list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
在上面的两个列表中,我们将找到列表中每个元素与其余元素之间的余弦相似度。即list_B中的emaillist_A中的每个元素:
def word2vec(word):
    from collections import Counter
    from math import sqrt

    # count the characters in word
    cw = Counter(word)
    # precomputes a set of the different characters
    sw = set(cw)
    # precomputes the "length" of the word vector
    lw = sqrt(sum(c*c for c in cw.values()))

    # return a tuple
    return cw, sw, lw

def cosdis(v1, v2):
    # which characters are common to the two words?
    common = v1[1].intersection(v2[1])
    # by definition of cosine distance we have
    return sum(v1[0][ch]*v2[0][ch] for ch in common)/v1[2]/v2[2]


list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']

threshold = 0.80     # if needed
for key in list_A:
    for word in list_B:
        try:
            # print(key)
            # print(word)
            res = cosdis(word2vec(word), word2vec(key))
            # print(res)
            print("The cosine similarity between : {} and : {} is: {}".format(word, key, res*100))
            # if res > threshold:
            #     print("Found a word with cosine distance > 80 : {} with original word: {}".format(word, key))
        except IndexError:
            pass

输出:

The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : user is: 22.360679774997898
The cosine similarity between : mail and : user is: 0.0
The cosine similarity between : address and : user is: 60.30226891555272
The cosine similarity between : netmail and : user is: 18.89822365046136
The cosine similarity between : email and : this is: 22.360679774997898
The cosine similarity between : mail and : this is: 25.0
The cosine similarity between : address and : this is: 30.15113445777636
The cosine similarity between : netmail and : this is: 37.79644730092272
The cosine similarity between : email and : email is: 100.0
The cosine similarity between : mail and : email is: 89.44271909999159
The cosine similarity between : address and : email is: 26.967994498529684
The cosine similarity between : netmail and : email is: 84.51542547285166
The cosine similarity between : email and : address is: 26.967994498529684
The cosine similarity between : mail and : address is: 15.07556722888818
The cosine similarity between : address and : address is: 100.0
The cosine similarity between : netmail and : address is: 22.79211529192759
The cosine similarity between : email and : customer is: 31.62277660168379
The cosine similarity between : mail and : customer is: 17.677669529663685
The cosine similarity between : address and : customer is: 42.640143271122085
The cosine similarity between : netmail and : customer is: 40.08918628686365
注意:我也在代码中对threshold部分进行了注释,以防您只想获取相似度超过某个阈值(即80%)的单词。
编辑:
OP: 但我想要做的不是逐个比较单词,而是逐个比较列表。
使用Countermath
from collections import Counter
import math

counterA = Counter(list_A)
counterB = Counter(list_B)


def counter_cosine_similarity(c1, c2):
    terms = set(c1).union(c2)
    dotprod = sum(c1.get(k, 0) * c2.get(k, 0) for k in terms)
    magA = math.sqrt(sum(c1.get(k, 0)**2 for k in terms))
    magB = math.sqrt(sum(c2.get(k, 0)**2 for k in terms))
    return dotprod / (magA * magB)

print(counter_cosine_similarity(counterA, counterB) * 100)

输出:

53.03300858899106

谢谢你的解决方案,但我想要做的不是逐字逐句的比较,而是列表之间的比较:['email','mail','address','netmail']与['email','user','this','email','address','customer']相比更相似(百分比非常高,输出应该为90%或更高,因为第一个列表中存在的大多数单词也存在于第二个列表中),另一方面,['email','mail','address','netmail']与['address','ip','network']相比,即使地址在第二个列表中存在,输出的百分比也很低(百分比相对于其他列表)。 - Youness Drissi Slimani
@YounessDrissiSlimani,你所说的“高匹配度”是指只考虑100%匹配的单词吗?如果是的话,我们可以计算出两个列表中有多少个单词是100%匹配的,然后给出一个估计的百分比。 - DirtyBit
1
@YounessDrissiSlimani 很好,如果这个回答有帮助的话,您可以接受它:https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work。干杯! - DirtyBit
你好@DirtyBit,我有一个问题。我试图将vocab=['address','ip']与两个列表list_1="identifiant adresse ip address fixe horadatee cookie mac".split()和list_2="address ville".split()进行比较。对于我来说,得分不完全正确。我想要的是list_1和vocab之间的余弦相似度更高=100%,因为vocab中的所有项都等于list_1中的某些项。 - Youness Drissi Slimani
@YounessDrissiSlimani 我这样帮不了你太多,请提出一个新问题,并详细说明您已经拥有的信息和尝试过的方法。 - DirtyBit
有没有办法获取单词列表之间相似度得分的解决方案? - Youness Drissi Slimani

5

您可以利用Scikit-Learn(或其他NLP)库来实现此目的。以下示例使用CountVectorizer,但对于更复杂的文档分析,最好使用TFIDF矢量化器。

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

def vect_cos(vect, test_list):
    """ Vectorise text and compute the cosine similarity """
    query_0 = vect.transform([' '.join(vect.get_feature_names())])
    query_1 = vect.transform(test_list)
    cos_sim = cosine_similarity(query_0.A, query_1.A)  # displays the resulting matrix
    return query_1, np.round(cos_sim.squeeze(), 3)

# Train the vectorizer
vocab=['email','user','this','email','address','customer']
vectoriser = CountVectorizer().fit(vocab)
vectoriser.vocabulary_ # show the word-matrix position pairs

# Analyse  list_1
list_1 = ['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])

# Analyse list_2
list_2 = ['address','ip','network']
list_2_vect, list_2_cos = vect_cos(vectoriser, [' '.join(list_2)])

print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))
print('\nThe cosine similarity for the second list is {}.'.format(list_2_cos))

输出

The cosine similarity for the first list is 0.632.

The cosine similarity for the second list is 0.447.

编辑

如果你想要计算“电子邮件”和其他字符串列表之间的余弦相似度,需要使用“电子邮件”训练向量化器,然后分析其他文档。

# Train the vectorizer
vocab=['email']
vectoriser = CountVectorizer().fit(vocab)

# Analyse  list_1
list_1 =['email','mail','address','netmail']
list_1_vect, list_1_cos = vect_cos(vectoriser, [' '.join(list_1)])
print('\nThe cosine similarity for the first list is {}.'.format(list_1_cos))

输出

The cosine similarity for the first list is 1.0.

好的解决方案,但如果我训练 ['email','mail','address','netmail'] 并分析 ['email'],输出为0.5,对我来说正确答案应该是0.99或1.0,因为电子邮件的权重非常高。 - Youness Drissi Slimani
1
你显然误解了代码的工作原理。你需要使用词汇['email']来训练向量化器,然后使用向量化器分析['email','mail','address','netmail'],以获得余弦相似度为1。请查看更新后的代码。 - KRKirov
你好 @KRKirov,我能训练 vocab=['address'] 并分析 list_1 = "ip address fixe mac cookie".split() 和 list_2 = "code postal address city ville".split() 吗?我想要得到 list_2 的相似度百分比高于 list_1,我该怎么做呢?我尝试在 list_2 中复制单词 address 以获得更高的分数,但它并没有起作用。 - Youness Drissi Slimani
当您使用单个“单词”训练向量化器时,任何后续字符串列表的分析都等同于二进制结果之间的结果:“列表包含该单词” - 结果1,“列表不包含该单词” - 结果0。因此,无论列表有多长,列表与“单词”之间的余弦相似度为1。实质上,您正在将列表投影到“单词”向量上,如果单词在列表中,则将“单词”与自身进行比较。 - KRKirov
另一方面,如果您使用list_1训练您的向量化器,然后分析list_1和list_2,它们之间的余弦相似度分别为1和0.45。 - KRKirov
显示剩余3条评论

0
我建议这个答案,因为问题的标题可能会吸引到寻求解决相关但不同问题的人。如果你只关心单词是否存在或缺失,那么一种方法是使用Jaccard相似度。虽然这可以在许多工具包中找到,但在Python中直接计算也非常容易。
def jaccard(list1, list2):
    s1 = set(list1)
    s2 = set(list2)
    u = s1 | s2
    if u:
        return float(len(s1 & s2))/float(len(u))
    else:
        return 0.0

list_A = ['email','user','this','email','address','customer']
list_B = ['email','mail','address','netmail']
list_C = ['address','ip','network']
jaccard(list_A, list_B)
jaccard(list_A, list_C)

输出

0.2857142857142857
0.14285714285714285

如果两个集合都是空的,Jaccard并没有明确定义结果,所以这个检查会判断它们不相似,但你也可以说它们完全相似(1.0)。你可以决定是否将这些值(0-1)转换为百分比进行打印输出。
两者都没有返回接近80%的结果,但这是因为这种方法只是精确匹配单词,而不是寻找“近似匹配”,比如“email”、“mail”和“netmail”。为此,你需要像nltk这样的工具,例如nltk.corpus.reader.wordnet。它也不敏感于'list_A'中的'email'出现两次这个事实,但从问题中并不清楚应该如何处理:当它在A中出现两次而在B中只出现一次时,这会增加相似性(因为有多个匹配对),还是减少相似性(因为你希望词频在两个集合之间相似)?

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接