我正在使用Pandas处理大约300MB的金融数据,这些数据对应于一场拍卖中的限价单。这是多维数据,看起来像这样:
bid ask
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity price quantity
2014-05-13 08:47:16.180000 102.298 1000000 102.297 1500000 102.296 6500000 102.295 8000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 6100000 102.304 14700000 102.305 3500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.003000 102.298 1000000 102.297 2500000 102.296 6500000 102.295 7000000 102.294 3000000 102.293 24300000 102.292 6000000 102.291 1000000 102.290 1000000 102.289 2500000 102.288 11000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9800000 102.308 15500000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.005000 102.298 3000000 102.297 3500000 102.296 6000000 102.295 9300000 102.294 4000000 102.293 17500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN NaN NaN 102.302 2000000 102.303 5100000 102.304 14700000 102.305 4500000 102.307 9000000 102.308 16300000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.006000 102.299 1000000 102.298 3000000 102.297 6500000 102.296 5000000 102.295 5300000 102.294 4000000 102.293 15500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 5100000 102.304 11700000 102.305 7500000 102.307 9000000 102.308 11300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
2014-05-13 08:47:17.007000 102.299 1000000 102.298 3000000 102.297 8500000 102.296 4000000 102.295 4300000 102.294 5000000 102.293 14500000 102.292 2000000 102.291 4000000 102.290 1000000 102.289 2500000 102.288 6000000 102.287 4000000 102.286 10100000 102.284 5000000 102.280 1500000 102.276 3000000 102.275 8100000 102.265 9500000 NaN NaN 102.302 2000000 102.303 4100000 102.304 13700000 102.305 7500000 102.307 8000000 102.308 12300000 102.309 5000000 102.310 5000000 102.312 7000000 102.313 1000000 102.315 8000000 102.316 4500000 102.320 4000000 102.321 1000000 102.324 4000000 102.325 9500000 NaN NaN NaN NaN NaN NaN NaN NaN
请注意,当您达到20时,第一级别会发生变化。对于表格的长格式,我们表示歉意...
我需要进行一些数据透视操作来处理数据。例如,不是使用0、1、2、3...(队列中订单的相对位置),而是使用102.297、102.296等,即订单价格作为索引。以下是这种操作的示例:
x.stack([0,0]).reset_index(drop=True,level=2).set_index("price",append=True).unstack([1,2]).fillna(0).diff().stack([1,1])
yielding:
quantity
side price
2014-05-13 08:47:17.003000 ask 102.300 0
102.301 0
102.302 0
102.303 -1000000
102.304 0
这可以通过
stack/unstack/reset_index
的组合实现,但似乎效率非常低。我没有查看代码,但我猜想每个stack
/unstack
都会复制表格,导致我的8GB系统耗尽内存并开始使用页面文件。我认为在这种情况下也不能使用pivot
,因为所需列位于多重索引中。您有什么建议可以加速吗?
这是一个示例输入CSV文件,如评论所述:
side,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,bid,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask,ask
level,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19,0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,16,16,17,17,18,18,19,19
value,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity,price,quantity
2014-05-13 08:47:16.18,102.298,1000000.0,102.297,1500000.0,102.296,6500000.0,102.295,8000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,6100000.0,102.304,1.47E7,102.305,3500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.003,102.298,1000000.0,102.297,2500000.0,102.296,6500000.0,102.295,7000000.0,102.294,3000000.0,102.293,2.43E7,102.292,6000000.0,102.291,1000000.0,102.29,1000000.0,102.289,2500000.0,102.288,1.1E7,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9800000.0,102.308,1.55E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.005,102.298,3000000.0,102.297,3500000.0,102.296,6000000.0,102.295,9300000.0,102.294,4000000.0,102.293,1.75E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.47E7,102.305,4500000.0,102.307,9000000.0,102.308,1.63E7,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.006,102.299,1000000.0,102.298,3000000.0,102.297,6500000.0,102.296,5000000.0,102.295,5300000.0,102.294,4000000.0,102.293,1.55E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,5100000.0,102.304,1.17E7,102.305,7500000.0,102.307,9000000.0,102.308,1.13E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
2014-05-13 08:47:17.007,102.299,1000000.0,102.298,3000000.0,102.297,8500000.0,102.296,4000000.0,102.295,4300000.0,102.294,5000000.0,102.293,1.45E7,102.292,2000000.0,102.291,4000000.0,102.29,1000000.0,102.289,2500000.0,102.288,6000000.0,102.287,4000000.0,102.286,1.01E7,102.284,5000000.0,102.28,1500000.0,102.276,3000000.0,102.275,8100000.0,102.265,9500000.0,N/A,N/A,102.302,2000000.0,102.303,4100000.0,102.304,1.37E7,102.305,7500000.0,102.307,8000000.0,102.308,1.23E7,102.309,5000000.0,102.31,5000000.0,102.312,7000000.0,102.313,1000000.0,102.315,8000000.0,102.316,4500000.0,102.32,4000000.0,102.321,1000000.0,102.324,4000000.0,102.325,9500000.0,N/A,N/A,N/A,N/A,N/A,N/A,N/A,N/A
y = y.reset_index(drop=True,level=2)
等)并在行之间添加一些gc.collect()
可能会解决此问题。https://github.com/pydata/pandas/issues/2659 - Raman Shah