根据n个连续条目替换列中的值

4

我有一列数据,其中值的范围为0、1和2。如果这些值在 n 行内没有重复出现,我想要将所有 1 和 2 的实例替换为值为 0。如果这些值在 n 行内重复出现,我想保留它们。例如,当 n=4 时:

df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2],
                   'desired': [0,0,0,0,0,0,0,0,0,2,2,2,2,2,2,0,0,0,0,0,0]})

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

预期的功能是基本上“过滤”基础数据中值的简短变化。我想能够声明必须连续多少个值才能保留基础数据,包括所有高于和低于'n'的值。(如果n = 4并且有6个连续值,我想要全部6个,而不仅仅是超过4的2个值。)在Pandas中是否有一种向量化的方法可以实现这一点?

3个回答

4

如果我理解正确的话,你可以做:

n = 6

# find where the values repeats n times
s = df['data'].diff().eq(0).rolling(n-1).sum()

# fill
df['desired'] = np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), df['data'], 0)

输出:

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

4

使用cumsum对数组的difference进行累积求和,然后获取每个分组的size

n = 4 
groups = df['data'].diff().ne(0).cumsum()
df['desired'] = df['data'].where(df.groupby(groups)['data'].transform('size').gt(n), other=0)

    data  desired
0   1     0      
1   0     0      
2   1     0      
3   2     0      
4   0     0      
5   0     0      
6   0     0      
7   1     0      
8   0     0      
9   2     2      
10  2     2      
11  2     2      
12  2     2      
13  2     2      
14  2     2      
15  0     0      
16  1     0      
17  0     0      
18  1     0      
19  0     0      
20  2     0  

时间设置:


# create sample dataframe of 1 million rows
df = pd.DataFrame({'data': [1,0,1,2,0,0,0,1,0,2,2,2,2,2,2,0,1,0,1,0,2]})
dfbig = pd.concat([df]*50000, ignore_index=True)
dfbig.shape

(1050000, 1)

Erfan

%%timeit
n = 4

groups = dfbig['data'].diff().ne(0).cumsum()
dfbig['data'].where(dfbig.groupby(groups)['data'].transform('size').gt(4), other=0)

268 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Quang Hoang

%%timeit
n=4
s = dfbig['data'].diff().eq(0).rolling(n-1).sum()

# fill
np.where(s.where(s>=n-1).bfill(limit=n-1).notnull(), dfbig['data'], 0)

164 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

piRSquared

%%timeit
a = dfbig.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

dfbig.data.where(np.bincount(b)[b] >= 4, 0)

62 ms ± 735 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

结论:

  1. piRSquared
  2. Quang Hoang
  3. Erfan

3

bincount

a = df.data.to_numpy()
b = np.append(0, a[:-1] != a[1:]).cumsum()

df.assign(desired=df.data.where(np.bincount(b)[b] >= 4, 0))

    data  desired
0      1        0
1      0        0
2      1        0
3      2        0
4      0        0
5      0        0
6      0        0
7      1        0
8      0        0
9      2        2
10     2        2
11     2        2
12     2        2
13     2        2
14     2        2
15     0        0
16     1        0
17     0        0
18     1        0
19     0        0
20     2        0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接