在每个组中，删除低于第5百分位数和高于第95百分位数的值。

Question

在每个组中，删除低于第5百分位数和高于第95百分位数的值。

3

我有一些数据集，包含以下列：order_code、city、weight。如何在每个城市内仅保留重量x介于该城市重量分布的5percentile和95percentile之间的包裹（类似于SQL中的窗口函数over(partition by city)）？

df = pd.DataFrame({
    'city': ['LA', 'Berlin', 'Hamburg', 'LA', 'Berlin', 'Hamburg', 'Tokyo', 'Hamburg', 'Berlin', 'Hamburg', 'Hamburg', 'Hamburg', 'Berlin', 'Hamburg', 'Berlin', 'Tokyo', 'Tokyo', 'Tokyo'],
    'weight': [930,933,1577,1018,547,981,1672,598,995,1164,601,1429,1349,1000,618,539,880,1472]
    })

- bluekit46

发布一个可测试的数据框。 - RomanPerekhrest

2个回答

0

使用 for 循环：

dflist = []

for f in df['city'].unique():
    df_city = df[df['city'] == str(f)]
    df_city = df_city[(df_city['weight'] > df_city.weight.quantile(0.05)) & (df_city['weight'] < df_city.weight.quantile(0.95))]   
    dflist.append(df_city)

dfe = pd.concat(dflist)

- Elkhan

当有很多城市时，超过500个，问题就变得更加困难了。 - bluekit46

使用我发布的for循环。 - Elkhan

2

你可以使用groupby而不是循环遍历唯一值：for city, df_city in df.groupby('city'): - oskros

如此有偏见的评估。 - Elkhan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- RomanPerekhrest · Accepted Answer

按照城市分组，并过滤出重量在分位数限制范围内的包裹：

df.groupby('city').apply(lambda x: x[(x.weight > x.weight.quantile(0.05)) 
                                     & (x.weight < x.weight.quantile(0.95))]).reset_index(drop=True)

     city  weight
0   Berlin     933
1   Berlin     995
2   Berlin     618
3  Hamburg     981
4  Hamburg    1164
5  Hamburg     601
6  Hamburg    1429
7  Hamburg    1000
8    Tokyo     880
9    Tokyo    1472