重塑pandas数据框并操作列。

Question

重塑pandas数据框并操作列。

3

我有一组数据集：

dat = {'Block': ['blk_-105450231192318816', 'blk_-1076549517733373559', 'blk_-1187723472581877455', 'blk_-1385756122847916710',  'blk_-1470784088028862059'], 'Seq': ['13 13 13 15',' 15 13 13', '13 13 15', '13 13 15 13', '13'], 'Time' : ['1257712532.0 1257712532.0 1257712532.0 1257712532.0','1257712533.0 1257712534.0 1257712534.0','1257712533.0 1257712533.0 1257712533.0','1257712532.0 1257712532.0 1257712532.0 1257712534.0','1257712535.0']}

df = pd.DataFrame(data = dat)

块是id。Seq是id。Time是unix格式的时间。

我想要更改列或创建新列。

1）我需要按照两列元素的索引将Seq和Time列连接起来。

2）之后，我想要获取Time列的差值（下一个元素-上一个元素），并将第一个元素设置为零。

最后，在文件中写入不同块的行，但具有相同的Seq-id。我想通过pandas方法解决这个问题。

我曾尝试使用字典来解决它，但这种方式比较复杂。

dict_block = dict((key, []) for key in np.unique(df.Block))
for idx, row in enumerate(seq):
    block = df.Block[idx]
    dict_seq = dict((key, []) for key in np.unique(row.split(' ')))
    for idy, key in enumerate(row.split(' ')):
        item = df.Time[idx].split(' ')[idy]
        dict_seq[key].append(item)
    dict_block[block].append(dict_seq)

例如：

blk_-105450231192318816 : 
    13: 1257712532.0, 1257712532.0, 1257712532.0
    15: 1257712532.0

2) 例如：

blk_-105450231192318816 : 
    13: 0, (1257712532.0 - 1257712532.0) = 0, (1257712532.0 - 1257712532.0) = 0
    15: 0

字典try的输出结果：

{'blk_-105450231192318816': 
[{'13': ['1257712532.0', '1257712532.0','1257712532.0'],
'15': ['1257712532.0']}],
'blk_-1076549517733373559': 
[{'13': ['1257712534.0', '1257712534.0'],
'15': ['1257712533.0']}],
'blk_-1187723472581877455': 
[{'13': ['1257712533.0', '1257712533.0'],
'15': ['1257712533.0']}],
'blk_-1385756122847916710': 
[{'13': ['1257712532.0',
'1257712532.0',
'1257712534.0'],
'15': ['1257712532.0']}],
'blk_-1470784088028862059': 
[{'13': ['1257712535.0']}]}

概述：

我想通过pandas、numpy方法解决以下问题：

1) 对列进行分组

2) 获取时间差(t1-t0)

期待您的回复 :)

- savchart

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Valentino · Accepted Answer

解决方案1：使用字典

如果您喜欢使用字典，可以使用apply和自定义方法，在其中使用字典进行操作。

df是您提供的示例数据框。这里我编写了两种方法。希望代码足够清晰易懂。

def grouping(x):
    """Make a dictionary combining 'Seq' and 'Time' columns.

    'Seq' elements are the keys, 'Time' are the values. 'Time' elements
    corresponding to the same key are stored in a list.
    """
    #splitting the string and make it numeric
    keys = list(map(int, x['Seq'].split()))
    times = list(map(float, x['Time'].split()))

    #building the result dictionary.
    res = {}
    for i, k in enumerate(keys):
        try:
            res[k].append(times[i])
        except KeyError:
            res[k] = [times[i]]

    return res    


def timediffs(x):
    """Make a dictionary starting from 'GroupedSeq' column, which can
    be created with the grouping function.

    It contains the difference between the times of each key.
    """
    ddt = x['GroupedSeq']
    res = {}
    #iterating over the dictionary to calculate the differences.
    for k, v in ddt.items():
        res[k] = [0.0] + [t1 - t0 for t0, t1 in zip(v[:-1], v[1:])]
    return res  

df['GroupedSeq'] = df.apply(grouping, axis=1)
df['difftimes'] = df.apply(timediffs, axis=1)

apply 的作用是对每一行应用函数。结果存储在数据帧的一个新列中。现在，df 包含两个新列，如果需要，您可以通过执行以下操作删除原始的 'Seq' 和 Time 列：df.drop(['Seq', 'Time'], axis=1, inplace=True)。最终，df 的样子如下：

                      Block                                            grouped                         difftimes
0   blk_-105450231192318816  {13: [1257712532.0, 1257712532.0, 1257712532.0...  {13: [0.0, 0.0, 0.0], 15: [0.0]}
1  blk_-1076549517733373559  {15: [1257712533.0], 13: [1257712534.0, 125771...       {15: [0.0], 13: [0.0, 0.0]}
2  blk_-1187723472581877455  {13: [1257712533.0, 1257712533.0], 15: [125771...       {13: [0.0, 0.0], 15: [0.0]}
3  blk_-1385756122847916710  {13: [1257712532.0, 1257712532.0, 1257712534.0...  {13: [0.0, 0.0, 2.0], 15: [0.0]}
4  blk_-1470784088028862059                               {13: [1257712535.0]}                       {13: [0.0]}

正如您所看到的，这里只使用pandas本身来应用自定义方法，但在这些方法内部，有正常的Python代码在运行。

解决方案2：无需使用字典，更多地使用Pandas

如果您在DataFrame中存储列表或字典，则Pandas本身并不是非常有用。因此，我提出了一种替代方案，即无需使用字典的解决方案。我使用groupby结合apply来根据其值对选定行执行操作。
groupby基于一个或多个列的值选择数据框的子样本：所有具有相同列值的行都被分组，并在此子样本上执行方法或动作。

再次说明，df是您提供的示例数据框。

df1 = df.copy() #working on a copy, not really needed but I wanted to preserve the original

##splitting the string and make it a numeric list using apply
df1['Seq'] = df1['Seq'].apply(lambda x : list(map(int, x.split())))
df1['Time'] = df1['Time'].apply(lambda x : list(map(float, x.split())))

#for each index in 'Block', unnest the list in 'Seq' making it a secodary index. 
df2 = df1.groupby('Block').apply(lambda x : pd.DataFrame([[e] for e in x['Time'].iloc[0]], index=x['Seq'].tolist()))
#resetting index and renaming column names created by pandas
df2 = df2.reset_index().rename(columns={'level_1':'Seq', 0:'Time'})

#custom method to store the differences between times.
def timediffs(x):
    x['tdiff'] = x['Time'].diff().fillna(0.0)
    return x

df3 = df2.groupby(['Block', 'Seq']).apply(timediffs)

最终的 df3 是：

                       Block      Seq          Time  tdiff
0    blk_-105450231192318816       13  1.257713e+09    0.0
1    blk_-105450231192318816       13  1.257713e+09    0.0
2    blk_-105450231192318816       13  1.257713e+09    0.0
3    blk_-105450231192318816       15  1.257713e+09    0.0
4   blk_-1076549517733373559       15  1.257713e+09    0.0
5   blk_-1076549517733373559       13  1.257713e+09    0.0
6   blk_-1076549517733373559       13  1.257713e+09    0.0
7   blk_-1187723472581877455       13  1.257713e+09    0.0
8   blk_-1187723472581877455       13  1.257713e+09    0.0
9   blk_-1187723472581877455       15  1.257713e+09    0.0
10  blk_-1385756122847916710       13  1.257713e+09    0.0
11  blk_-1385756122847916710       13  1.257713e+09    0.0
12  blk_-1385756122847916710       15  1.257713e+09    0.0
13  blk_-1385756122847916710       13  1.257713e+09    2.0
14  blk_-1470784088028862059       13  1.257713e+09    0.0

正如您所看到的，数据框中没有字典。在列“Block”和“Seq”中有重复，但这是不可避免的。