高效重建DataFrame字典的方法

Question

高效重建DataFrame字典的方法

pythonpandasperformancedictionarystructure

3

我有一个填充有多个数据帧的字典。现在我正在寻找一种高效的方法来更改键结构，但是当涉及到更多的数据帧/更大的数据帧时，我找到的解决方案相当缓慢。因此，我想问问是否有人知道比我的更方便/高效/快速的方法或方法。所以，首先，我创建了这个示例来展示我最初的起点：

import pandas as pd
import numpy as np

# assign keys to dic
teams = ["Arsenal", "Chelsea", "Manchester United"]
dic_teams = {}

# fill dic with random entries
for t1 in teams:

    dic_teams[t1] = pd.DataFrame({'date': pd.date_range("20180101", periods=30), 
                                  'Goals': pd.Series(np.random.randint(0,5, size = 30)),
                                  'Chances': pd.Series(np.random.randint(0,15, size = 30)),
                                  'Fouls': pd.Series(np.random.randint(0, 20, size = 30)),
                                  'Offside': pd.Series(np.random.randint(0, 10, size = 30))})

    dic_teams[t1] = dic_teams[t1].set_index('date')
    dic_teams[t1].index.name = None

现在我基本上有一个字典，其中每个键都是一个团队，这意味着我对于每个团队都有一个包含他们随时间表现信息的数据框。现在我希望能够更改这个特定的字典，使得键是日期而不是团队。这意味着我对于每个日期都有一个数据框，其中填充了该日期每个团队的表现。我使用下面的代码实现了这一点，它可以工作，但是一旦我添加了更多的团队和表现因素，速度就会变得非常慢：

# prepare lists for looping
dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}

# new structure where key = date
for d in dates:
    dic_dates[d] = pd.DataFrame(index = teams, columns = perf)

    for t2 in teams:
        dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

因为我在使用嵌套循环，所以重构我的字典很慢。有人有什么想法可以改进第二段代码吗？我不仅寻找解决方案，也希望能得到更好的逻辑或思路。

非常感谢您的帮助！

- Sanoj

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jérôme Richard · Accepted Answer

创建Pandas数据框的方式通常很慢，直接索引也是如此。复制数据框却出奇地快。因此，您可以使用多次复制的空引用数据框。以下是代码：

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
zygote = pd.DataFrame(index = teams, columns = perf)
dic_dates = {}

# new structure where key = date
for d in dates:
    dic_dates[d] = zygote.copy()

    for t2 in teams:
        dic_dates[d].loc[t2] = dic_teams[t2].loc[d]

这比我的机器上的参考速度快了大约两倍。

克服缓慢的数据帧直接索引很棘手。我们可以使用numpy来解决这个问题。实际上，我们可以将数据帧转换为3D numpy数组，使用numpy执行转置，最后再将切片转换回数据帧。请注意，此方法假定所有值都是整数并且输入数据帧结构良好。

以下是最终实现：

dates = dic_teams["Arsenal"].index.to_list()
perf = dic_teams["Arsenal"].columns.to_list()
dic_dates = {}

# Create a numpy array from Pandas dataframes
# Assume the order of the `dates` and `perf` indices are the same in all dataframe (and their order)
full = np.empty(shape=(len(teams), len(dates), len(perf)), dtype=int)
for tId,tName in enumerate(teams):
    full[tId,:,:] = dic_teams[tName].to_numpy()

# New structure where key = date, created from the numpy array
for dId,dName in enumerate(dates):
    dic_dates[dName] = pd.DataFrame({pName: full[:,dId,pId] for pId,pName in enumerate(perf)}, index = teams)

这个实现在我的机器上比参考实现快了6.4倍。需要注意的是，大约75%的时间被耗费在pd.DataFrame调用上。因此，如果你想要更快的代码，使用基本的3D numpy数组！