我正在使用 Pandas 1.0 编写一个计算数据集中每个具有相同ID的项目的运行最大值的高效程序。我的程序使用 iterrows() 并通过索引设置每个高水位标记,导致速度非常慢。由于数据集非常大,这不是可行的解决方案。
import pandas as pd
import sys
data = [[1, 10],
[1, 15],
[1, 10],
[1, 0],
[1, 5],
[1, 20],
[1, 0],
[1, 10],
[2, 5],
[2, 15],
[2, 10],
[2, 20],
[2, 25],
[2, 20],
[2, 30],
[2, 10]]
df = pd.DataFrame(data, columns=['id', 'val'])
high_water_mark = -sys.maxsize
previous_row = None
for index, row in df.iterrows():
current_val = row['val']
if index == 0:
df.loc[index, 'running_maximum'] = current_val
high_water_mark = current_val
previous_row = row
continue
if row['id'] == previous_row['id'].item():
if current_val > high_water_mark:
df.loc[index, 'running_maximum'] = current_val
high_water_mark = current_val
else:
df.loc[index, 'running_maximum'] = high_water_mark
else:
df.loc[index, 'running_maximum'] = current_val
high_water_mark = current_val
previous_row = row
print(df)
输出:
id val running_maximum
0 1 10 10.0
1 1 15 15.0
2 1 10 15.0
3 1 0 15.0
4 1 5 15.0
5 1 20 20.0
6 1 0 20.0
7 1 10 20.0
8 2 5 5.0
9 2 15 15.0
10 2 10 15.0
11 2 20 20.0
12 2 25 25.0
13 2 20 25.0
14 2 30 30.0
15 2 10 30.0
有没有关于如何加快这个过程的建议?