我可以从pandas dataframe的分类列中创建多层字典,最多三层 - 参见代码。
但是我的解决方案太过硬编码... 如果我想要按10个分类列进行'split'呢?
我正在寻找一些能够执行以下伪代码操作的东西:
基本上:为了轻松地从数据框创建任意深度的嵌套字典,创建可操作于任何深度的“键级别”函数,并在整个字典上迭代,而无需为每个级别编写代码。
我的代码:
我正在寻找一些能够执行以下伪代码操作的东西:
d = {'A': ['a1','a1','a2'], 'B': ['b1','b2','b3'], 'C': ['c1','c2','c2'], 'v': [0,5,1]}
df = pd.DataFrame(data=d)
dA = tree(df=d, cols=['A'])
#it gives dictionary of two dataframes
# "tree" should be some standard implementation
#a1
#a2
dB = tree(df=d, cols=['A', 'B'])
#it give dictionary of three dataframes at lowest level
#a1_b1
#a1_b2
#a2_b3
#"tree" should be ready for any number of cols
#acces operations
dA['a1'], dB['a1'], dB['a1]['b1],...
#iteration operation (transpose is just for example)
dA = dA.iter.T #transposes every dataframe
dB = dB.iter.T #transposes every dataframe on lowest level i.e. dB['a1]['b1].T, dB['a1]['b2].T, ...
#some operations will require access to dictionary keys to make sense or to have enough flexibility:
dA.iter.to_csv(str(key)+'csv')
#produces a1.csv, a2.csv
dB.iter.to_csv(str(key)+'csv')
#produces a1_b1.csv, a1_b2.csv, a2_b3.csv
基本上:为了轻松地从数据框创建任意深度的嵌套字典,创建可操作于任何深度的“键级别”函数,并在整个字典上迭代,而无需为每个级别编写代码。
我的代码:
import pandas as pd
from collections import defaultdict
# sample dataframe
d = {'A': ['a1','a1','a2'], 'B': ['b1','b2','b3'], 'C': ['c1','c2','c2'], 'v': [0,5,1]}
df = pd.DataFrame(data=d)
# make dictionary of dataframes based on categorical column, every categroy is a key to dataframe
def dict_dfs_based_on_cat(df, col):
Cat = df[col].unique()
dictDFbasedOnCat = {elem: pd.DataFrame for elem in Cat}
for key in dictDFbasedOnCat.keys():
dictDFbasedOnCat[key] = df[:][df[col]==key]
return dictDFbasedOnCat
#1st level
di_A = dict_dfs_based_on_cat(df, 'A')
#2nd level
di_A_B= {}
for a in di_A:
di_A_B[a] = dict_dfs_based_on_cat(di_A[a], 'B')
#3rd level
di_A_B_C = defaultdict(dict)
for a in di_A:
for b in di_A_B[a]:
di_A_B_C[a][b] = dict_dfs_based_on_cat(di_A_B[a][b],'C')
#operations on 3rd level
def iter_di(msg, func, di):
print(msg)
for a in di:
for b in di[a]:
for c in di[a][b]:
func(a, b, c, di)
def save(a, b, c, di):
di[a][b][c].to_csv(str(a)+'_'+str(b)+'_'+str(c)+'.csv', index=False)
#sample operation
iter_di('saving', save, di_A_B_C)
#a1_b1_c1.csv
#a1_b2_c2.csv
#a2_b3_c2.csv