我注意到pandas在基于MultiIndex合并数据帧时非常缓慢。同时,赋值操作有时也很慢。
import pandas as pd
import numpy as np
from pandas_datareader import data
import datetime
import string
import random
start = datetime.datetime(2002, 1, 1)
end = datetime.datetime(2018, 1, 1)
def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
return ''.join(random.choice(chars) for _ in range(size))
columns = [id_generator() for i in range(1000)]
dateindex = pd.date_range(start, end)
df = pd.DataFrame(np.random.randint(1, 100, (len(dateindex), len(columns))), columns=columns, index=dateindex)
df.columns = df.columns.rename('Name')
df.index = df.index.rename('Date')
df1 = df.pct_change(1).stack().rename('change1').to_frame()
df2 = df.pct_change(2).stack().rename('change2').to_frame()
df3 = df1.reset_index()
df4 = df2.reset_index()
%timeit pd.merge(df1, df2, left_index=True, right_index=True)
In [11]: 46.7 s ± 656 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.merge(df3, df4, on=['Date', 'Name'])
In [12]: 3.17 s ± 168 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
速度慢了10倍以上。有人知道是怎么回事吗?是否总是更好地重置索引并在列上连接,而不是使用MultiIndex。