Python数据框架 - 分组和质心计算

Question

Python数据框架 - 分组和质心计算

4

我有一个带有两列的数据框：一列包含一个类别，另一列包含一个300维向量。对于类别列中的每个值，我有很多个300维向量。我需要做的是按照类别列分组数据框，并同时获取属于每个类别的所有向量的质心值。

Category        Vector   
Balance        [1,2,1,-5,....,9]  
Inquiry        [-5,3,1,5,...,10]  
Card           [-3,1,2,3,...1]  
Balance        [1,3,-2,1,-5,...,7]  
Card           [3,1,3,4,...,2]

所以在上面的情况下，期望的输出将是：

Category       Vector   
Balance        [1,2.5,-0.5,-2,....,8]  
Inquiry        [-5,3,1,5,...,10]  
Card           [0,1,2.5,3.5,...,1.5]

我已经编写了以下函数，用于获取向量数组并计算其质心：

import numpy as np
    def get_intent_centroid(array):
        centroid = np.zeros(len(array[0]))
        for vector in array:
            centroid = centroid + vector
        return centroid/len(array)

所以我只需要一个快速的方法，在dataframe上使用上述函数和groupby命令。

请原谅我对数据框的格式设置，但我不知道如何正确设置它们。

- user7831701

不确定如何在 pandas 中对列中的向量进行操作，但您可以尝试将这两列更改为列表，然后进行操作并转换回 pandas！ - Dreams

我认为如果不使用列表，整个计算过程会更快。 - user7831701

@Tarun，你会如何使用列表来处理它？ - user7831701

我已经发布了一个答案，如果你在pandas中找不到方法，可以尝试这样做。 - Dreams

4个回答

1

根据楼主的要求，我有一种通过列表实现的方法：

vectorsList = list(df["Vector"])
catList = list(df["Category"])

#create a dict for each category and initialise it with a list of 300, zeros
dictOfCats = {}
for each in set(cat):
    dictOfCats[each]= [0] * 300

#loop through the vectorsList and catList
for i in range(0, len(catList)):
    currentVec = dictOfCats[each]
    for j in range(0, len(vectorsList[i])):
        currentVec[j] = vectorsList[i][j] + currentVec[j]
    dictOfCats[each] = currentVec

#now each element in dict has sum. you can divide it by the count of each category
#you can calculate the frequency by groupby, here since i have used only lists, i am showing execution by lists
catFreq = {} 
for eachCat in catList:
    if(eachCat in catList):
        catList[eachCat] = catList[eachCat] + 1
    else:
        catList[eachCat] = 1


for eachKey in dictOfCats:
    currentVec = dictOfCats[eachKey]
    newCurrentVec = [x / catList[eachKey] for x in currentVec]
    dictOfCats[eachKey] = newCurrentVec

#now change this dictOfCats to dataframe again

请注意，由于我没有使用您的数据进行检查，代码中可能会存在错误。这将消耗计算资源，但如果您无法通过pandas找到解决方案，它应该能够完成工作。如果您确实在pandas中找到了解决方案，请发布答案。

- Dreams

0

import pandas as pd
import numpy as np

df = pd.DataFrame(
    [
        {'category': 'Balance', 'vector':  [1,2,1,-5,9]},
        {'category': 'Inquiry', 'vector': [-5,3,1,5,10]},
        {'category': 'Card', 'vector': [-3,1,2,3,1]},
        {'category': 'Balance', 'vector':  [1,3,-2,1,7]},
        {'category': 'Card', 'vector':  [3,1,3,4,2]}
    ]
)


def get_intent_centroid(array):
    centroid = np.zeros(len(array[0]))
    for vector in array:
        centroid = centroid + vector
    return centroid/len(array)


df.groupby('category')['vector'].apply(lambda x: get_intent_centroid(x.tolist()))

Output:

category
Balance    [1.0, 2.5, -0.5, -2.0, 8.0]
Card         [0.0, 1.0, 2.5, 3.5, 1.5]
Inquiry    [-5.0, 3.0, 1.0, 5.0, 10.0]
Name: vector, dtype: object

- mgcdanny

0

不使用列表应该也可以实现这个功能。

def get_intent_centroid(array):
    centroid = np.zeros(len(array.iloc[0]))
    for vector in array:
        centroid = centroid + vector
    return centroid/len(array.iloc[0])

df.groupby('Catagory')['Vector'].apply(get_intent_centroid)

- alhanaei

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ken Syme · Accepted Answer

因此，向量列表的质心就是向量每个维度的平均值，因此这可以大大简化为以下内容。

df.groupby('Category')['Vector'].apply(lambda x: np.mean(x.tolist(), axis=0))

它应该比任何循环/列表转换方法都要快。