按两列分组并比较一列行的内容

Question

按两列分组并比较一列行的内容

3

我正在使用groupby，但我不想失去未包含在groupby中的其他列，例如我有一个数据框：

id     date     name    item    price    unit    store
1    1/1/2020   abc    apples    200    Fruits   BigB
1    1/2/2020   abc    apples    100    Fruits   BigB
1    1/3/2020   abc    apples    250    Fruits   BigB
1    1/1/2020   abc    mangoes   350    Fruits   BigB
1    1/2/2020   abc    mangoes   150    Fruits   BigB
1    1/3/2020   abc    mangoes   50     Fruits   BigB
2    1/1/2020   xyz    apples    50     Fruits   BigB
2    1/2/2020   xyz    apples    50     Fruits   BigB

我想创建两列标志并根据id和name开始，如果价格值大于其前一行，则标志为1，否则为0。数据必须基于id、name和items。start列基于flag。最初的start值将是第一行的价格值。如果flag为0，则start具有先前的值，当flag改变为1时，start也将更改为其相应的价格值。

输出将为：

id     date     name    item    price    unit    store  Flag      start
1    1/1/2020   abc    apples    200    Fruits   BigB   0          200
1    1/2/2020   abc    apples    100    Fruits   BigB   0          200
1    1/3/2020   abc    apples    250    Fruits   BigB   1          250
1    1/1/2020   abc    mangoes   350    Fruits   BigB   0          350 
1    1/2/2020   abc    mangoes   150    Fruits   BigB   0          350
1    1/3/2020   abc    mangoes   50     Fruits   BigB   0          350
2    1/1/2020   xyz    apples    50     Fruits   BigB   0          50
2    1/2/2020   xyz    apples    50     Fruits   BigB   0          50

按照id、name和item列进行分组，提前致谢。df已按照id、name、item和date进行排序。

- naina

第二个表没有进行分组。例如，您有三行数据 (id, name, item) = (1, abc, apples)。 - Amin Ba

实际上，我不想进行真正的分组，我想根据这三列添加一个标志列。 - naina

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Shubham Sharma · Accepted Answer

方法

m = df[['id','name','item']].duplicated()

df['flag'] = df.eval('price > price.shift() and @m').astype(int)
df['start'] = df['price'].where(~m | df['flag']).ffill()

说明

考虑列 id、name 和 item，以识别数据框中的重复行，以创建布尔掩码 m

>>> m

0    False
1     True
2     True
3    False
4     True
5     True
6    False
7     True
dtype: bool

在比较相邻行的 price 列并将其与掩码 m 的逻辑 and 运算结果用于创建 flag 列的帧上评估布尔表达式。

>>> df['flag']

0    0
1    0
2    1
3    0
4    0
5    0
6    0
7    0
Name: flag, dtype: int64

现在，对于不满足条件~m | df['flag']的price列中的值进行掩码处理，并向前填充以传播这些值。

>>> df['start']

0    200.0
1    200.0
2    250.0
3    350.0
4    350.0
5    350.0
6     50.0
7     50.0
Name: start, dtype: float64

>>> df

   id      date name     item  price    unit store  start  flag
0   1  1/1/2020  abc   apples    200  Fruits  BigB  200.0     0
1   1  1/2/2020  abc   apples    100  Fruits  BigB  200.0     0
2   1  1/3/2020  abc   apples    250  Fruits  BigB  250.0     1
3   1  1/1/2020  abc  mangoes    350  Fruits  BigB  350.0     0
4   1  1/2/2020  abc  mangoes    150  Fruits  BigB  350.0     0
5   1  1/3/2020  abc  mangoes     50  Fruits  BigB  350.0     0
6   2  1/1/2020  xyz   apples     50  Fruits  BigB   50.0     0
7   2  1/2/2020  xyz   apples     50  Fruits  BigB   50.0     0