在Pandas DataFrame上使用scipy pdist()函数

Question

在Pandas DataFrame上使用scipy pdist()函数

7

我可以帮助你翻译。以下是需要翻译的内容：

我有一个大型数据框（例如：15k个对象），其中每一行都是一个对象，列是数字对象特征。它的格式如下：

df = pd.DataFrame({ 'A' : [0, 0, 1],
                    'B' : [2, 3, 4],
                    'C' : [5, 0, 1],
                    'D' : [1, 1, 0]},
                    columns= ['A','B', 'C', 'D'], index=['first', 'second', 'third'])

我想要计算所有对象（行）之间的配对距离，并且得知scipy's pdist()函数是由于其计算效率而成为好的解决方案。我可以简单地调用：

res = pdist(df, 'cityblock')
res
>> array([ 6.,  8.,  4.])

请注意，res 数组中的距离按以下顺序排列：[first-second, first-third, second-third]。

我的问题是如何将其以矩阵、数据框或（不太理想的）字典格式获取，以便我确切地知道每个距离值属于哪对？例如下面所示：

       first second third
first    0      -     -
second   6      0     -
third    8      4     0

最终，我认为将距离矩阵作为pandas DataFrame可能会很方便，因为我可以对每一行应用一些排名和排序操作（例如，找到最接近物体first的前N个对象）。

- Zhubarb

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Zhubarb · Accepted Answer

哦，我在这个网页上找到了答案。显然，有一个专门的函数叫做squareform()。暂时不删除我的问题，以防对其他人有帮助。

from scipy.spatial.distance import squareform
res = pdist(df, 'cityblock')
squareform(res)
pd.DataFrame(squareform(res), index=df.index, columns= df.index)
>>        first  second  third
>>first       0       6      8
>>second      6       0      4
>>third       8       4      0