Python: 使用单词交集而不是字符交集计算Jaccard距离

Question

Python: 使用单词交集而不是字符交集计算Jaccard距离

10

我没有意识到Python的set函数实际上会将字符串分成单个字符。我为Jaccard编写了一个Python函数，并使用了Python的交集方法。在将两个集合传递到我的jaccard函数之前，我在字符串集合上使用了set函数。

例如：假设我有字符串NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg，我会调用set(NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg)将字符串分解为字符。因此，当我将其传递给jaccard函数时，交集实际上是字符交集，而不是单词对单词的交集。如何进行单词对单词的交集？

#implementing jaccard
def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

如果我不在字符串 NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg 上调用 set 函数，会得到以下错误：

    c = a.intersection(b)
AttributeError: 'str' object has no attribute 'intersection'

我想进行词级别的交集运算并获取Jaccard相似度，而不是字符级别的交集。

- add-semi-colons

4个回答

8

我的计算Jaccard距离的函数：

def DistJaccard(str1, str2):
    str1 = set(str1.split())
    str2 = set(str2.split())
    return float(len(str1 & str2)) / len(str1 | str2)

>>> DistJaccard("hola amigo", "chao amigo")
0.333333333333

- JBrain

3

这个属性不仅适用于集合:

>>> list('NEW Fujifilm')
['N', 'E', 'W', ' ', 'F', 'u', 'j', 'i', 'f', 'i', 'l', 'm']

这里发生的情况是将字符串视为可迭代序列，并逐个字符处理。

对于 set 也是一样的情况：

>>> set('string')
set(['g', 'i', 'n', 's', 'r', 't'])

要修复这个问题，请在现有的集合上使用 .add()，因为 .add() 不使用可迭代对象：

>>> se=set()
>>> se.add('NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg')
>>> se
set(['NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'])

或者使用split()、元组、列表或其他可迭代对象，以便字符串不被视为可迭代对象：

>>> set('something'.split())
set(['something'])
>>> set(('something',))
set(['something'])
>>> set(['something'])
set(['something'])

基于您的字符串逐个单词添加更多元素：

>>> se=set(('Something',)) | set('NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'.split())

或者，如果你需要对添加到集合中的某些逻辑进行理解：

>>> se={w for w in 'NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'.split() 
         if len(w)>3}
>>> se
set(['Shoot', 'CAMERA', 'Point', 'screen.jpg', 'Zoom', 'Fujifilm', '16MP', 'Optical'])

现在它按照您的期望工作：

>>> 'Zoom' in se
True
>>> s1=set('NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'.split())
>>> s2=set('Fujifilm Optical Zoom CAMERA NONE'.split())
>>> s1.intersection(s2)
set(['Optical', 'CAMERA', 'Zoom', 'Fujifilm'])

- dawg

1

从“逐字交集”我认为 OP 真正想要的是 set(a.split()).intersection(b.split())（忽略大小写和标点细节）。 - DSM

2

这是我基于集合函数编写的代码 -

def jaccard(a,b):
    a=a.split()
    b=a.split()
    union = list(set(a+b))
    intersection = list(set(a) - (set(a)-set(b)))
    print "Union - %s" % union
    print "Intersection - %s" % intersection
    jaccard_coeff = float(len(intersection))/len(union)
    print "Jaccard Coefficient is = %f " % jaccard_coeff

- medakeshav

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Amber · Accepted Answer

先尝试将您的字符串分成单词：

word_set = set(your_string.split())

示例：

>>> word_set = set("NEW Fujifilm 16MP 5x".split())
>>> character_set = set("NEW Fujifilm 16MP 5x")
>>> word_set
set(['NEW', '16MP', '5x', 'Fujifilm'])
>>> character_set
set([' ', 'f', 'E', 'F', 'i', 'M', 'j', 'm', 'l', 'N', '1', 'P', 'u', 'x', 'W', '6', '5'])