Python DataFrame：基于自定义条件分割行？

Question

Python DataFrame：基于自定义条件分割行？

3

我有一个包含三列name、content和day的数据框df。

        content          day           name
    0     first_day      01-01-2017      marcus
    1     present        10-01-2017      marcus
    2     first_day      01-02-2017      marcus
    3     first_day      01-03-2017      marcus
    4     absent         05-03-2017      marcus
    5     present        20-03-2017      marcus
    6     first_day      01-04-2017      bruno
    7     present        11-04-2017      bruno
    8     first_day      01-05-2017      bruno
    9     absent         02-05-2017      bruno
    10    first_day      01-06-2017      bruno
    11    absent         02-06-2017      bruno
    12    payment        09-06-2017      bruno

我正在尝试按月查找拥有“first_day”、“absent”和“present”连续行的用户。

示例输出：

        content          day           name         absent_after_present
    0     first_day      01-01-2017      marcus         False
    1     first_day      01-02-2017      marcus         False
    2     first_day      01-03-2017      marcus         True
    3     first_day      01-04-2017      bruno          False
    4     first_day      01-05-2017      bruno          False
    5     first_day      01-06-2017      bruno          True

例子：在同一个月中，marcus从2017年3月1日开始一直有first_day，absent和present，分别对应于2017年3月1日、2017年3月5日和2017年3月20日。因此，marcus的状态应该是True。

- user15590480

最后一行的 content 字段应该是 present 而不是 payment，对吧？否则，在2017年6月的Bruno中，就没有匹配的模式了。 - SeaBean

3个回答

1

也许您可以尝试提取每月的内容，然后按名称和月份进行分组，如下所示。

import pandas as pd

data = pd.DataFrame({'content' : ['first_day','present', 'first_day', 'first_day', 'absent', 
'present', 'first_day', 'present', 'first_day', 'absent', 'first_day', 'absent', 'present'],
'day' : ['2017-01-01', '2017-01-10', '2017-02-01', '2017-03-01', '2017-03-05', '2017-03-20',
'2017-04-01', '2017-04-11', '2017-05-01', '2017-05-02', '2017-06-01', '2017-06-02', '2017-06-09'],
'name' : ['marcus', 'marcus', 'marcus', 'marcus', 'marcus', 'marcus', 'bruno', 'bruno', 'bruno',
'bruno', 'bruno', 'bruno', 'bruno']})

data['day'] = pd.to_datetime(data['day'])

data['month'] = data.day.dt.month

data_new = pd.DataFrame(data.groupby(['name', 'month'])['content'].unique()).join(pd.DataFrame(data.groupby(['name', 'month'])['day'].unique()), on=['name', 'month'])

data_new['absent_after_present'] = data_new['content'].apply(lambda x : True if len(x) == 3 and len(set(x)) == 3 else False)
data_new['day'] = data_new['day'].apply(lambda x : x[0])
data_new['content'] = data_new['content'].apply(lambda x : x[0])

data_new = data_new.droplevel(1)



data_new


name    content        day  absent_after_present

bruno   first_day   2017-04-01  False
bruno   first_day   2017-05-01  False
bruno   first_day   2017-06-01  True
marcus  first_day   2017-01-01  False
marcus  first_day   2017-02-01  False
marcus  first_day   2017-03-01  True

- Arjun Nair

你的代码如何确保"first_day, absent和present连续"？还是只要确保它们都存在但可以按任意顺序？ - SeaBean

是的，以任何顺序返回此数据中出现的顺序。例如，在布鲁诺的第6个月，它的顺序为[第一天，缺席，出席]。 - Arjun Nair

1

如果某个组在'present'之前（而不是之后）包含'absent'，则您期望的输出结果包含True。

因此，我将源DataFrame定义为：

      content         day    name
0   first_day  01-01-2017  marcus
1     present  10-01-2017  marcus
2   first_day  01-02-2017  marcus
3   first_day  01-03-2017  marcus
4      absent  05-03-2017  marcus
5     present  20-03-2017  marcus
6   first_day  01-04-2017   bruno
7     present  11-04-2017   bruno
8   first_day  01-05-2017   bruno
9      absent  02-05-2017   bruno
10  first_day  01-06-2017   bruno
11     absent  02-06-2017   bruno
12    present  09-06-2017   bruno

（最后一行有变化，请注意）。

从以下内容开始：

import itertools

然后定义一个函数，该函数返回源组(grp)的第一行，并为最后（新）列添加值:

def getRow(grp):
    lst = [k for k, g in itertools.groupby(grp.content)]
    isAbs = lst[-2] == 'absent' and lst[-1] == 'present' if len(lst) > 1 else False
    return grp.iloc[0].append(pd.Series([isAbs], index=['absent_before_present']))

要获得预期结果，请运行：

result = df.groupby([pd.to_datetime(df.day, dayfirst=True)
    .apply(lambda x: x.strftime('%Y-%m')), 'name']).apply(getRow)\
    .reset_index(drop=True)

结果是：

这是结果。

     content         day    name  absent_before_present
0  first_day  01-01-2017  marcus                  False
1  first_day  01-02-2017  marcus                  False
2  first_day  01-03-2017  marcus                   True
3  first_day  01-04-2017   bruno                  False
4  first_day  01-05-2017   bruno                  False
5  first_day  01-06-2017   bruno                   True

请注意，上面的代码实际上使用了两种不同的groupby方法：

来自Pandas（按年份、月份和名称对df进行分组），
来自itertools，其中每个新值（除了当前值）都会创建一个新的输出组。

- Valdi_Bo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- SeaBean · Accepted Answer

请尝试以下操作：

def pattern_in_group(s):
    s_list = s.to_list()
    for i in range(0, len(s_list)-2):
        if ['first_day', 'absent', 'present' ] == s_list[i:i+3]:
            return True
    return False

df['day1'] = pd.to_datetime(df['day'], dayfirst=True)
df['absent_after_present'] = df.groupby(['name', df['day1'].dt.year, df['day1'].dt.month])['content'].transform(pattern_in_group)

df2 = df.groupby(['name', df['day1'].dt.year, df['day1'].dt.month], as_index=False).first().drop(columns='day1')

print(df2)



     name    content         day  absent_after_present
0   bruno  first_day  01-04-2017                 False
1   bruno  first_day  01-05-2017                 False
2   bruno  first_day  01-06-2017                  True
3  marcus  first_day  01-01-2017                 False
4  marcus  first_day  01-02-2017                 False
5  marcus  first_day  01-03-2017                  True

由于您在最后一行示例数据中出现了一个错字，我已经进行了更正:

测试数据构建

data = {'content': ['first_day', 'present', 'first_day', 'first_day', 'absent', 'present', 'first_day', 'present', 'first_day', 'absent', 'first_day', 'absent', 'present'], 
 'day': ['01-01-2017', '10-01-2017', '01-02-2017', '01-03-2017', '05-03-2017', '20-03-2017', '01-04-2017', '11-04-2017', '01-05-2017', '02-05-2017', '01-06-2017', '02-06-2017', '09-06-2017'],
 'name': ['marcus', 'marcus', 'marcus', 'marcus', 'marcus', 'marcus', 'bruno', 'bruno', 'bruno', 'bruno', 'bruno', 'bruno', 'bruno']}   

df = pd.DataFrame(data)

print(df)

      content         day    name
0   first_day  01-01-2017  marcus
1     present  10-01-2017  marcus
2   first_day  01-02-2017  marcus
3   first_day  01-03-2017  marcus
4      absent  05-03-2017  marcus
5     present  20-03-2017  marcus
6   first_day  01-04-2017   bruno
7     present  11-04-2017   bruno
8   first_day  01-05-2017   bruno
9      absent  02-05-2017   bruno
10  first_day  01-06-2017   bruno
11     absent  02-06-2017   bruno
12    present  09-06-2017   bruno