Python:如何根据第一列的值将pandas DataFrame拆分为子集?

5

我有一个大型的实验日志文件(.txt),包含多达100,000个条目,其结构如下:

ROUTINE    TEMPERATURE    VOLTAGE    WAVELENGTH
_______________________________________________
CHANGE T   75             0          560
CHANGE T   80             0          560
CHANGE T   85             0          560
CHANGE T   90             0          560
OSL        75             20         570
OSL        75             20         580
OSL        75             20         590
OSL        75             20         600
CHANGE T   75             0          560
CHANGE T   80             0          560
CHANGE T   85             0          560
CHANGE T   90             0          560

我使用来自 pandasread_table 将日志文件加载到python中。我想基于第一列的值将结果数据框分成较小的数据框。结果应如下所示:

**DATAFRAME 1:**    
CHANGE T   75             0          560
CHANGE T   80             0          560
CHANGE T   85             0          560
CHANGE T   90             0          560

**DATAFRAME 2:** 
OSL        75             20         570
OSL        75             20         580
OSL        75             20         590
OSL        75             20         600

**DATAFRAME 3:** 
CHANGE T   75             0          560
CHANGE T   80             0          560
CHANGE T   85             0          560
CHANGE T   90             0          560

首先,我尝试使用第一列值改变的指数来拆分它们:

indexSplit = [] # list containing the boundry indices

prevRoutine = log['ROUTINE'][0] # log is the complete dataframe
i = 1
while i < len(log):
        if prevRoutine != log['ROUTINE'][i]:
            indexSplit.append(i)
        prevRoutine = log['ROUTINE'][i]

然而,考虑到日志文件的大小,用这种方式需要花费大量时间(显然)。我想知道是否有一种优雅的方法可以使用pandas来完成这个任务?我经常遇到的问题是第一列的值在不止一个系列中使用。我总是以数据帧1数据帧3作为一个整体。

1个回答

6
您可以使用列表推导式,其中循环groupby对象和groupss创建。在此处通过ne(与!=相同但更快)进行比较,然后通过shift列和cumsum获取输出结果:
s = df['ROUTINE'].ne(df['ROUTINE'].shift()).cumsum()
print (s)
0     1
1     1
2     1
3     1
4     2
5     2
6     2
7     2
8     3
9     3
10    3
11    3
Name: ROUTINE, dtype: int32

dfs = [g for i,g in df.groupby(df['ROUTINE'].ne(df['ROUTINE'].shift()).cumsum())]
print (dfs)
[    ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
0  CHANGE T           75        0         560
1  CHANGE T           80        0         560
2  CHANGE T           85        0         560
3  CHANGE T           90        0         560,   ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
4     OSL           75       20         570
5     OSL           75       20         580
6     OSL           75       20         590
7     OSL           75       20         600,      ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
8   CHANGE T           75        0         560
9   CHANGE T           80        0         560
10  CHANGE T           85        0         560
11  CHANGE T           90        0         560]

print (dfs[0])
    ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
0  CHANGE T           75        0         560
1  CHANGE T           80        0         560
2  CHANGE T           85        0         560
3  CHANGE T           90        0         560

print (dfs[1])
  ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
4     OSL           75       20         570
5     OSL           75       20         580
6     OSL           75       20         590
7     OSL           75       20         600

print (dfs[2])
     ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
8   CHANGE T           75        0         560
9   CHANGE T           80        0         560
10  CHANGE T           85        0         560
11  CHANGE T           90        0         560

解决方案很复杂,因为如果只对第一列使用 groupby,将只得到2个组:
dfs = [g for i,g in df.groupby('ROUTINE')]
print (dfs)
[     ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
0   CHANGE T           75        0         560
1   CHANGE T           80        0         560
2   CHANGE T           85        0         560
3   CHANGE T           90        0         560
8   CHANGE T           75        0         560
9   CHANGE T           80        0         560
10  CHANGE T           85        0         560
11  CHANGE T           90        0         560,   ROUTINE  TEMPERATURE  VOLTAGE  WAVELENGTH
4     OSL           75       20         570
5     OSL           75       20         580
6     OSL           75       20         590
7     OSL           75       20         600]

pandas.DataFrame.ne是什么作用? - MMF
抱歉,请给我一点时间。 - jezrael
`In [470]: %timeit (df['a'].ne(df['a'].shift()).cumsum()) 1 loop, best of 3: 481 ms per loopIn [471]: %timeit ((df['a'] != df['a'].shift()).cumsum()) 1 loop, best of 3: 1.56 s per loopIn [472]: %timeit (df['b'].ne(df['b'].shift()).cumsum()) 1 loop, best of 3: 209 ms per loopIn [473]: %timeit ((df['b'] != df['b'].shift()).cumsum()) 1 loop, best of 3: 210 ms per loopfor10M` - jezrael
非常感谢,这正是我所需要的,并且它运行起来非常迅速! - David VdH

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接