Pandas数据框删除重复索引，根据列值保留最大值

Question

Pandas数据框删除重复索引，根据列值保留最大值

pandasdataframejupyter-notebooktime-seriestrading

3

这是我的当前数据框。我希望通过以下3个步骤转换数据框。我需要删除重复的时间戳，但想根据“Side”列保留最大值或最小值。请帮忙 :)

我尝试过df = df [~df.index.duplicated(keep ='first')]，但它没有保留最大或最小值的选项。

索引类型为日期格式，价格为浮点数，边际为整数，数据框有8000+行。

                          Price      Side  
2021-12-13 00:00:03.285   51700      4     
2021-12-13 00:00:03.315   51675      3    
2021-12-13 00:00:03.333   50123      4    
2021-12-13 00:00:03.333   50200      3    
2021-12-13 00:00:03.333   50225      3   
2021-12-13 00:00:03.333   50250      3    
2021-12-13 00:00:03.421   50123      4     
2021-12-13 00:00:03.421   50117      4     
2021-12-13 00:00:03.421   50110      4    
2021-12-13 00:00:03.671   50100      3

如果时间重复，且边是“3”，则保留最高值；如果时间重复且边是“4”，则保留最低值。

Desired Output:
                          Price      Side  
2021-12-13 00:00:03.285   51700      4     
2021-12-13 00:00:03.315   51675      3    
2021-12-13 00:00:03.333   50123      4 
2021-12-13 00:00:03.333   50250      3     
2021-12-13 00:00:03.421   50110      4     
2021-12-13 00:00:03.671   50100      3

创建新的列"3"和"4"，并填入相应的价格

Desired Output:
                          Price      3         4  
2021-12-13 00:00:03.285   51700      0         51700
2021-12-13 00:00:03.315   51675      51675     0  
2021-12-13 00:00:03.333   50123      0         50123
2021-12-13 00:00:03.333   50250      50250     0     
2021-12-13 00:00:03.421   50110      0         50110  
2021-12-13 00:00:03.671   50100      50100     0

将同一列中之前的数值填入空白处

Desired Output:
                          Price      3         4  
2021-12-13 00:00:03.285   51700      0         51700  
2021-12-13 00:00:03.315   51675      51675     51700  
2021-12-13 00:00:03.333   50123      51675     50123
2021-12-13 00:00:03.333   50250      50250     50123     
2021-12-13 00:00:03.421   50110      50250     50110  
2021-12-13 00:00:03.671   50100      50100     50110

- blueorchid

“if side is 3” 是什么意思？组“2021-12-13 00:00:03.333”有四个项目，其中三个边长为3，一个为4。你如何确定该时间的边长？ - user17242583

1

哦，等等，你想按时间和方向分组。 - user17242583

2个回答

0

这是一个选项，有点长：

(df.assign(temp = df.Side.map({4:'low', 3:'high'}))
.groupby([pd.Grouper(level=0), 'Side', 'temp'], sort = False)
.Price
.agg(['min', 'max'])
.unstack('Side')
.loc(axis=1)[[('max', 3), ('min', 4)]]
.droplevel(level = 0,axis = 1)
.droplevel(level = 'temp')
.assign(Price=lambda df: df[3].where(df[3].notna(), df[4]))
.ffill()
.fillna(0)
.astype(int)
.rename_axis(columns = None)
)

                             3      4  Price
2021-12-13 00:00:03.285      0  51700  51700
2021-12-13 00:00:03.315  51675  51700  51675
2021-12-13 00:00:03.333  51675  50123  50123
2021-12-13 00:00:03.333  50250  50123  50250
2021-12-13 00:00:03.421  50250  50110  50110
2021-12-13 00:00:03.671  50100  50110  50100

这假设 Side 中的唯一值为 3 和 4。

- sammywemmy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- user17242583 · Accepted Answer

new_df = (df
    .groupby([pd.Grouper(level=0), 'Side'])
    .apply(lambda x: x['Price'].max() if x['Side'].mode()[0] == 3 else x['Price'].min())
    .reset_index()
)
new_df = (
    pd.concat([
        new_df,
        (new_df
            .pivot(columns='Side', values=0)
            .ffill()
            .fillna(0)
        )
    ], axis=1)
    .drop('Side', axis=1)
    .rename({0: 'Price'}, axis=1)
)

输出：

>>> df
                    index  Price        3        4
0 2021-12-13 00:00:03.285  51700      0.0  51700.0
1 2021-12-13 00:00:03.315  51675  51675.0  51700.0
2 2021-12-13 00:00:03.333  50250  50250.0  51700.0
3 2021-12-13 00:00:03.333  50123  50250.0  50123.0
4 2021-12-13 00:00:03.421  50110  50250.0  50110.0
5 2021-12-13 00:00:03.671  50100  50100.0  50110.0

紧凑版：

new_df = df.groupby([pd.Grouper(level=0), 'Side']).apply(lambda x: x['Price'].max() if x['Side'].mode()[0] == 3 else x['Price'].min()).reset_index()
new_df = pd.concat([new_df, new_df.pivot(columns='Side', values=0).ffill().fillna(0)], axis=1).drop('Side', axis=1).rename({0:'Price'}, axis=1))