Scipy负距离？什么意思？

Question

Scipy负距离？什么意思？

10

我有一个输入文件，其中包含四位小数的浮点数：

i.e. 13359    0.0000    0.0000    0.0001    0.0001    0.0002`    0.0003    0.0007    ...

首先，这里的id是指标识符。

我们班使用的是loadVectorsFromFile方法，该方法会将向量乘以10000，并使用int()取整。此外，我还循环遍历每个向量，以确保其中没有负值。然而，当我执行_hclustering时，我不断看到错误信息："Linkage contains negative values"。

我认为这很可能是一个bug，因为：

我检查了我的值，
这些值远远没有接近浮点数的极限，
我用来导出文件中的值的公式使用了绝对值（我的输入绝对正确）。

请问有谁能告诉我为什么会出现这种奇怪的错误？是什么导致了这个负距离错误？

=====

def loadVectorsFromFile(self, limit, loc, assertAllPositive=True, inflate=True):
    """Inflate to prevent "negative" distance, we use 4 decimal points, so *10000
    """
    vectors = {}
    self.winfo("Each vector is set to have %d limit in length" % limit)
    with open( loc ) as inf:
        for line in filter(None, inf.read().split('\n')):
            l = line.split('\t')
            if limit:
                scores = map(float, l[1:limit+1])
            else:
                scores = map(float, l[1:])

            if inflate:        
                vectors[ l[0]] = map( lambda x: int(x*10000), scores)     #int might save space
            else:
                vectors[ l[0]] = scores                           

    if assertAllPositive:
        #Assert that it has no negative value
        for dirID, l in vectors.iteritems():
            if reduce(operator.or_, map( lambda x: x < 0, l)):
                self.werror( "Vector %s has negative values!" % dirID)
    return vectors

def main( self, inputDir, outputDir, limit=0,
        inFname="data.vectors.all", mappingFname='all.id.features.group.intermediate'):
    """
    Loads vector from a file and start clustering
    INPUT
        vectors is { featureID: tfidfVector (list), }
    """
    IDFeatureDic = loadIdFeatureGroupDicFromIntermediate( pjoin(self.configDir, mappingFname))
    if not os.path.exists(outputDir):
        os.makedirs(outputDir)

    vectors = self.loadVectorsFromFile( limit, pjoin( inputDir, inFname))
    for threshold in map( lambda x:float(x)/30, range(20,30)):
        clusters = self._hclustering(threshold, vectors)
        if clusters:
            outputLoc = pjoin(outputDir, "threshold.%s.result" % str(threshold))
            with open(outputLoc, 'w') as outf:
                for clusterNo, cluster in clusters.iteritems():
                    outf.write('%s\n' % str(clusterNo))
                    for featureID in cluster:
                        feature, group = IDFeatureDic[featureID]
                        outline = "%s\t%s\n" % (feature, group)
                        outf.write(outline.encode('utf-8'))
                    outf.write("\n")
        else:
            continue

def _hclustering(self, threshold, vectors):
    """function which you should call to vary the threshold
    vectors:    { featureID:    [ tfidf scores, tfidf score, .. ]
    """
    clusters = defaultdict(list)
    if len(vectors) > 1:
        try:
            results = hierarchy.fclusterdata( vectors.values(), threshold, metric='cosine')
        except ValueError, e:
            self.werror("_hclustering: %s" % str(e))
            return False

        for i, featureID in enumerate( vectors.keys()):

- disappearedng

1

我在Scipy中遇到了一个问题——出现了意外的负值。对我来说，问题是我不知道Scipy中的三角函数默认情况下需要弧度值。 - doug

5个回答

5

我很确定这是因为在调用fclusterdata时使用了余弦度量。尝试使用欧几里得距离，看看是否会消除错误。

如果集合中两个向量的点积大于1，则余弦度量可能为负数。由于您正在使用非常大的数字并将它们归一化，所以我相信在您的数据集中很多时候点积大于1。如果要使用余弦度量，则需要规范化数据，使得两个向量的点积永远不大于1。请查看此页面上所定义的Scipy中余弦度量的公式。

编辑：从查看源代码来看，我认为该页面列出的公式实际上并不是Scipy使用的公式（这很好，因为源代码看起来正在使用正常和正确的余弦距离公式）。然而，到达联接时，显然有一些负值存在于联接中，原因不明。尝试使用method='cosine'使用scipy.spatial.distance.pdist()找到向量之间的距离，并检查是否存在负值。如果没有，则与使用距离值形成联接有关。

- Justin Peel

很棒的答案。关于“规范化数据”，我在规范化我的数据方面有哪些选择，以便我仍然可以使用scipy中的余弦距离？我尝试过不进行任何形式的规范化（仅使用本机tfidf值进行计算）。不用说，由于在这么长的长度上添加了浮点数的不准确性，问题仍然存在。你会推荐我做什么？ - disappearedng

首先，您应该检查问题出在哪里。是在距离计算之后吗？如果余弦方法已经正确执行（尽管文档中说不是这样），那么就不需要进行归一化。顺便说一下，尝试使用“old_cosine”作为您的度量标准，看看是否仍然会出现错误。 - Justin Peel

1

我遇到了同样的问题。你可以重写余弦函数。例如：

from sklearn.metrics.pairwise import cosine_similarity
def mycosine(x1, x2):
    x1 = x1.reshape(1,-1)
    x2 = x2.reshape(1,-1)
    ans = 1 - cosine_similarity(x1, x2)
    return max(ans[0][0], 0)

...

clusters = hierarchy.fclusterdata(data, threshold, criterion='distance', metric=mycosine, method='average')

- Indira Kurmantayeva

1

"Linkage Z包含负值"。当层次聚类过程中链接矩阵中的任何链接聚类索引被指定为-1时，也会出现此错误。

根据我的观察，在组合过程中，当要组合的所有聚类或点之间的距离为负无穷大时，任何链接聚类索引都会被赋值为-1。因此，即使它们之间的链接距离为负无穷大，链接函数也会组合聚类。并将其中一个聚类或点分配为负索引。

总结一下，如果您使用余弦距离作为度量标准，并且任何数据点的范数或大小为零，则会发生此错误。

- Alok Nayak

0

我无法改进Justin的答案，但另一个需要注意的点是您的数据处理。

您说您做类似于int(float("0.0003") * 10000)来读取数据。但如果您这样做，您将得到不是3而是2.9999999999999996。这是因为浮点数的不准确性被乘以了。

更好或者至少更准确的方法是通过在字符串中进行乘法运算。也就是说，使用字符串操作从0.0003到3.0等等。

也许甚至有一个Python数据类型扩展可以读取这种数据而不会失去精度，在转换之前可以执行乘法运算。我对SciPy/numerics不熟悉，所以我不知道。

编辑

Justin评论说Python内置了十进制类型。它可以解释字符串，与整数相乘并转换为浮点数（我测试过了）。既然如此，我建议更新您的逻辑，例如：

factor = 1
if inflate:
  factor = 10000
scores = map(lambda x: float(decimal.Decimal(x) * factor), l[1:])

这会在一定程度上减少你的舍入问题。

- extraneon

是的，有这样一个模块。它被称为decimal。http://docs.python.org/library/decimal.html - Justin Peel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- dkar · Accepted Answer

这是由于浮点数不精确造成的，因此您的向量之间的某些距离，而不是为0，例如为-0.000000000000000002。使用scipy.clip()函数来纠正问题。如果您的距离矩阵是dmatr，请使用numpy.clip(dmatr,0,1,dmatr)，然后您就可以了。