按值范围分组数据

Question

按值范围分组数据

17

我有一个CSV文件，显示了订单上的零件。列包括延迟天数、数量和商品。

我需要按照延迟天数和商品对数据进行分组，并统计数量之和。但是延迟天数需要分成几个范围。

>56
>35 and <= 56
>14 and <= 35
>0 and <=14

我希望能够以某种方式使用字典。就像这样

{'Red':'>56,'Amber':'>35 and <= 56','Yellow':'>14 and <= 35','White':'>0 and <=14'}

我想要的结果是这样的

        Red  Amber  Yellow  White
STRSUB  56   60     74      40
BOTDWG  20   67     87      34

我对pandas很陌生，不知道是否有可能。能否提供一些建议？

谢谢！

- PrestonDocks

3个回答

6

您可以使用map或apply函数根据您的Days Late列在DataFrame中创建一个新列，具体方法如下。让我们先创建一些示例数据。

df = pandas.DataFrame({ 'ID': 'foo,bar,foo,bar,foo,bar,foo,foo'.split(','),
                        'Days Late': numpy.random.randn(8)*20+30})

   Days Late   ID
0  30.746244  foo
1  16.234267  bar
2  14.771567  foo
3  33.211626  bar
4   3.497118  foo
5  52.482879  bar
6  11.695231  foo
7  47.350269  foo

创建一个辅助函数来转换“Days Late”列的数据，并添加一个名为“Code”的列。

def days_late_xform(dl):
    if dl > 56: return 'Red'
    elif 35 < dl <= 56: return 'Amber'
    elif 14 < dl <= 35: return 'Yellow'
    elif 0 < dl <= 14: return 'White'
    else: return 'None'

df["Code"] = df['Days Late'].map(days_late_xform)

   Days Late   ID    Code
0  30.746244  foo  Yellow
1  16.234267  bar  Yellow
2  14.771567  foo  Yellow
3  33.211626  bar  Yellow
4   3.497118  foo   White
5  52.482879  bar   Amber
6  11.695231  foo   White
7  47.350269  foo   Amber

最后，您可以使用 groupby 按 ID 和 Code 列进行聚合，并按以下方式获取组的计数：

g = df.groupby(["ID","Code"]).size()
print g

ID   Code
bar  Amber     1
     Yellow    2
foo  Amber     1
     White     2     
     Yellow    2

df2 = g.unstack()
print df2

Code  Amber  White  Yellow
ID
bar       1    NaN       2
foo       1      2       2

- mtadd

谢谢。我会在今天上班时查看这个，并告诉你进展如何。 - PrestonDocks

你能告诉我如何对这些结果进行透视吗？我认为groupby生成的系列无法进行透视。 - PrestonDocks

groupby 方法生成一个带有 MultiIndex 的 Series。您可以使用 unstack 将最低级别的索引透视为列，如上面编辑的答案所示。 - mtadd

4

我知道有点晚了，但是我也遇到了和你一样的问题，想要分享一下np.digitize函数。听起来这正是你想要的。

a = np.random.randint(0, 100, 50)
grps = np.arange(0, 100, 10)
grps2 = [1, 20, 25, 40]
print a
[35 76 83 62 57 50 24  0 14 40 21  3 45 30 79 32 29 80 90 38  2 77 50 73 51
 71 29 53 76 16 93 46 14 32 44 77 24 95 48 23 26 49 32 15  2 33 17 88 26 17]

print np.digitize(a, grps)
[ 4  8  9  7  6  6  3  1  2  5  3  1  5  4  8  4  3  9 10  4  1  8  6  8  6
  8  3  6  8  2 10  5  2  4  5  8  3 10  5  3  3  5  4  2  1  4  2  9  3  2]

print np.digitize(a, grps2)
[3 4 4 4 4 4 2 0 1 4 2 1 4 3 4 3 3 4 4 3 1 4 4 4 4 4 3 4 4 1 4 4 1 3 4 4 2
 4 4 2 3 4 3 1 1 3 1 4 3 1]

- choldgraf

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- unutbu · Accepted Answer

假设您从这些数据开始：

df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
                   'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
                   'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
#    Days Late      ID  quantity
# 0         60  STRSUB        56
# 1         60  BOTDWG        20
# 2         50  STRSUB        60
# 3         50  BOTDWG        67
# 4         20  STRSUB        74
# 5         20  BOTDWG        87
# 6         10  STRSUB        40
# 7         10  BOTDWG        34

接下来，您可以使用pd.cut找到状态类别。请注意，默认情况下，pd.cut将系列df['Days Late']分成半开区间的类别，(-1, 14]、(14, 35]、(35, 56]、(56, 365]：

df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
print(df)
#        ID  quantity  status
# 0  STRSUB        56     Red
# 1  BOTDWG        20     Red
# 2  STRSUB        60   Amber
# 3  BOTDWG        67   Amber
# 4  STRSUB        74  Yellow
# 5  BOTDWG        87  Yellow
# 6  STRSUB        40   White
# 7  BOTDWG        34   White

现在使用pivot命令获取期望格式的DataFrame。

df = df.pivot(index='ID', columns='status', values='quantity')

使用 reindex 方法以获得所需的行和列顺序：

df = df.reindex(columns=labels[::-1], index=df.index[::-1])

因此，

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID': ('STRSUB BOTDWG'.split())*4,
                   'Days Late': [60, 60, 50, 50, 20, 20, 10, 10],
                   'quantity': [56, 20, 60, 67, 74, 87, 40, 34]})
df['status'] = pd.cut(df['Days Late'], bins=[-1, 14, 35, 56, 365], labels=False)
labels = np.array('White Yellow Amber Red'.split())
df['status'] = labels[df['status']]
del df['Days Late']
df = df.pivot(index='ID', columns='status', values='quantity')
df = df.reindex(columns=labels[::-1], index=df.index[::-1])
print(df)

产量

        Red  Amber  Yellow  White
ID                               
STRSUB   56     60      74     40
BOTDWG   20     67      87     34