如何使用sklearn计算两个字符串列表的余弦相似度？

Question

如何使用sklearn计算两个字符串列表的余弦相似度？

3

我有两个包含字符串的列表，如下所示：

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']

我想计算这两个列表的余弦相似度，我知道如何实现它：

# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]       
b_vect = [b_vals.get(word, 0) for word in words]        

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))   
cosine = dot / (len_a * len_b) 

print(cosine)

然而，如果我想在sklearn中使用cosine_similarity，它会显示问题：could not convert string to float: 'a'。如何纠正呢？

from sklearn.metrics.pairwise import cosine_similarity

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']
print(cosine_similarity(a_file, b_file))

- 4daJKong

你想如何定义字符的余弦相似度？ - BrokenBenchmark

请始终将完整的错误消息（从“Traceback”一词开始）作为文本（而不是屏幕截图或指向外部门户网站的链接）放在问题中。还有其他有用的信息。 - furas

也许首先查看文档，看它是否计算你需要的内容。它似乎需要数字，而不是字符串/字符。 - furas

这对我有用 cosine_similarity([a_vect], [b_vect])。第一：它需要词向量。第二：它需要二维向量--就像在具有许多行的 DataFrame 中。 - furas

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- furas · Accepted Answer

看起来需要：

词向量,
二维数据（包含多个词向量的列表）

print(cosine_similarity( [a_vect], [b_vect] ))

完整的可用代码：

from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity

a_file = ['a', 'b', 'c']
b_file = ['b', 'x', 'y', 'z']

# count word occurrences
a_vals = Counter(a_file)
b_vals = Counter(b_file)

# convert to word-vectors
words  = list(a_vals.keys() | b_vals.keys())
a_vect = [a_vals.get(word, 0) for word in words]       
b_vect = [b_vals.get(word, 0) for word in words]        

# find cosine
len_a  = sum(av*av for av in a_vect) ** 0.5             
len_b  = sum(bv*bv for bv in b_vect) ** 0.5             
dot    = sum(av*bv for av,bv in zip(a_vect, b_vect))   
cosine = dot / (len_a * len_b) 

print(cosine)
print(cosine_similarity([a_vect], [b_vect]))

结果：

0.2886751345948129
[[0.28867513]]

编辑：

您还可以使用包含所有数据的一个列表（因此第二个参数将为 None ），它将比较所有对（a，a），（a，b），（b，a），（b，b）。

print(cosine_similarity( [a_vect, b_vect] ))

结果：

[[1.         0.28867513]
 [0.28867513 1.        ]]

您可以使用更长的列表[a,b,c, ...]，它将检查所有可能的配对。

文档: 余弦相似度