在pandas中创建一个带有递增计数器的列,用于识别重复的集合

3

我有一个大的数据框,其中一部分列是相同的 dup_columns = ['id', 'subject','topic', 'lesson', 'time'],而另一些则是独特的 ['timestamps'].

   id    subj   topic lesson  timestamp  time  dup_ind dup_group  time_diff
1   1  math    add     a   timestamp1  45sec   True   1      timestamp1-timestamp2         
2   1  math    add     a   timestamp2  45sec   True   1      timestamp1-timestamp2
3   1  math    add     a   timestamp2  30sec   False   NaN
4   1  math    add     a   timestamp3  15sec   False   NaN
5   1  math    add     b   timestamp1  0sec    True    2     timestamp1-timestamp4
6   1  math    add     b   timestamp4  0sec    True    2     timestamp1-timestamp4
7   1  math    add     b   timestamp1  45sec   True    3     timestamp1-timestamp2
8   1  math    add     b   timestamp2  45sec   True    3     timestamp1-timestamp2

我有一列名为['is_duplicate'],它基于dup_columns确定了重复项。我需要创建另一列['dup_group']通过分配唯一的重复组值(1,2,3 ...)唯一标识每个重复的行集。最终,我需要使用dup_group在每个重复组内比较timestamp值(我使用.diff()方法实现此目的)。
以下是我编写的代码:
df2= df1.loc[df1['is_duplicated']==True]
def dup_counter():
    for name, group in df11.groupby(dup_columns):
        df[name, df['dupsetnew']]+=1
    return df['dupsetnew']

df11.groupby(dup_columns).apply(dup_counter)

问题1: 函数给我报错了(我是Python和编程的新手)

为了计算时间戳之间的差异,我有以下代码:

df['time_diff'] = df.loc[df.dup_indicator == 1,'event_time'].diff()

问题2:对于我需要的内容,.diff是正确的方法吗?

1个回答

0

这是一种方法。请注意,我已将df ['timestamp']更改为整数系列以演示原则,但可以针对datetime对象进行调整。

思路是使用pd.factorize在元组列表上标识组。然后应用前向和后向的groupby.diff以获得所需结果。

df['timestamp'] = [1, 2, 2, 3, 1, 4, 1, 2]

df['dup_group'] = pd.factorize(list(zip(df['id'], df['subj'], df['topic'], 
                                        df['lesson'], df['time'])))[0] + 1

df['time_diff'] = df.groupby('dup_group')['timestamp'].transform(pd.Series.diff)

df['time_diff'] = df['time_diff'].fillna(-df.groupby('dup_group')['timestamp']\
                                            .transform(pd.Series.diff, periods=-1))

#    id  subj topic lesson  timestamp   time  dup_ind  dup_group  time_diff
# 1   1  math   add      a          1  45sec     True          1        1.0
# 2   1  math   add      a          2  45sec     True          1        1.0
# 3   1  math   add      a          2  30sec    False          2        NaN
# 4   1  math   add      a          3  15sec    False          3        NaN
# 5   1  math   add      b          1   0sec     True          4        3.0
# 6   1  math   add      b          4   0sec     True          4        3.0
# 7   1  math   add      b          1  45sec     True          5        1.0
# 8   1  math   add      b          2  45sec     True          5        1.0

源数据

from numpy import nan

df = pd.DataFrame({'dup_group': {1: 1.0, 2: 1.0, 3: nan, 4: nan, 5: 2.0, 6: 2.0, 7: 3.0, 8: 3.0},
 'dup_ind': {1: True, 2: True, 3: False, 4: False, 5: True, 6: True, 7: True, 8: True},
 'id': {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
 'lesson': {1: 'a', 2: 'a', 3: 'a', 4: 'a', 5: 'b', 6: 'b', 7: 'b', 8: 'b'},
 'subj': {1: 'math', 2: 'math', 3: 'math', 4: 'math', 5: 'math', 6: 'math', 7: 'math', 8: 'math'},
 'time': {1: '45sec', 2: '45sec', 3: '30sec', 4: '15sec', 5: '0sec', 6: '0sec', 7: '45sec', 8: '45sec'},
 'time_diff': {1: 'timestamp1-timestamp2', 2: 'timestamp1-timestamp2', 3: nan, 4: nan, 5: 'timestamp1-timestamp4', 6: 'timestamp1-timestamp4', 7: 'timestamp1-timestamp2', 8: 'timestamp1-timestamp2'},
 'timestamp': {1: 'timestamp1', 2: 'timestamp2', 3: 'timestamp2', 4: 'timestamp3', 5: 'timestamp1', 6: 'timestamp4', 7: 'timestamp1', 8: 'timestamp2'},
 'topic': {1: 'add', 2: 'add', 3: 'add', 4: 'add', 5: 'add', 6: 'add', 7: 'add', 8: 'add'}})

我遇到了一个错误:在代码的第二行(pd.factorize...)中出现了ValueError: Buffer has wrong number of dimensions (expected 1, got 2)。 - at_ca
@at_ca,我添加了我的数据...这是从你发布的内容复制的。它确实有效。 - jpp
非常感谢!您能否解释一下factorize的基本过程以及为什么会有[0]+1?此外,为什么在代码的第四行中df.groupby前面有一个减号? - at_ca

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接