使用pandas重新排列分组数据框中的一部分列

Question

使用pandas重新排列分组数据框中的一部分列

pythonpandasmultiple-columnsswapmulti-index

4

我有一些按月份分组的预测数据。原始数据框架长这个样子：something

>>clean_table_grouped[0:5]
       STYLE    COLOR    SIZE   FOR
MONTH                           01/17    10/16   11/16    12/16
    0 #######   ######   ####   0.0      15.0    15.0     15.0
    1 #######   ######   ####   0.0      15.0    15.0     15.0
    2 #######   ######   ####   0.0      15.0    15.0     15.0
    3 #######   ######   ####   0.0      15.0    15.0     15.0
    4 #######   ######   ####   0.0      15.0    15.0     15.0

>>clean_table_grouped.ix[0:,"FOR"][0:5] 
 MONTH  01/17  10/16  11/16  12/16
0        0.0   15.0   15.0   15.0
1        0.0   15.0   15.0   15.0
2        0.0   15.0   15.0   15.0
3        0.0   15.0   15.0   15.0
4        0.0   15.0   15.0   15.0

我只想按照以下方式重新排列这4列：

（保持数据框的其他部分不变）

MONTH    10/16  11/16  12/16  01/17
0        15.0   15.0   15.0   0.0
1        15.0   15.0   15.0   0.0
2        15.0   15.0   15.0   0.0
3        15.0   15.0   15.0   0.0
4        15.0   15.0   15.0   0.0

我的尝试解决方案是按照下面的帖子重新排列子集的列：如何更改DataFrame列的顺序？

我通过获取列列表并首先进行排序来完成此操作。

 >>for_cols = clean_table_grouped.ix[:,"FOR"].columns.tolist()
 >>for_cols.sort(key = lambda x: x[0:2])   #sort by month ascending
 >>for_cols.sort(key = lambda x: x[-2:])   #then sort by year ascending

查询数据框运行良好

>>clean_table_grouped.ix[0:,"FOR"][for_cols]
MONTH   10/16   11/16  12/16  01/17
0        15.0    15.0    15.0    0.0
1        15.0    15.0    15.0    0.0
2        15.0    15.0    15.0    0.0
3        15.0    15.0    15.0    0.0
4        15.0    15.0    15.0    0.0

然而，当我尝试在原始表中设置值时，我得到了一个“NaN”的表格：

>>clean_table_grouped.ix[0:,"FOR"] = clean_table_grouped.ix[0:,"FOR"][for_cols]
>>clean_table_grouped.ix[0:,"FOR"]
MONTH  01/17  10/16  11/16  12/16
0        NaN    NaN    NaN    NaN
1        NaN    NaN    NaN    NaN
2        NaN    NaN    NaN    NaN
3        NaN    NaN    NaN    NaN
4        NaN    NaN    NaN    NaN
5        NaN    NaN    NaN    NaN

我还尝试了压缩以避免链接语法（.ix [] []）。这样可以避免NaN，但是它不会改变数据框架-__-

>>for_cols = zip(["FOR", "FOR", "FOR", "FOR"], for_cols)
>>clean_table_grouped.ix[0:,"FOR"] = clean_table_grouped.ix[0:,for_cols]
>>clean_table_grouped.ix[0:,"FOR"]
 MONTH  01/17  10/16  11/16  12/16
 0        0.0   15.0   15.0   15.0
 1        0.0   15.0   15.0   15.0
 2        0.0   15.0   15.0   15.0
 3        0.0   15.0   15.0   15.0
 4        0.0   15.0   15.0   15.0

我意识到我正在使用ix重新分配值。然而，我曾经在非分组的数据帧上使用过这种技术，并且它已经完美地工作了。

如果这个问题已经在另一个帖子中得到了回答（以清晰的方式），请提供链接。我搜索过了，但没有找到类似的内容。

编辑: 我已经找到了解决方案。通过创建一个新的多索引数据帧来手动重新索引，以您想要排序的顺序。我在下面发布了解决方案。

- xdzzz

你的原始DataFrame结构是什么？ - juanpa.arrivillaga

2个回答

0

我的解决方案基于以下帖子的第二个答案：如何在特定级别重新排序多索引数据框架列

基本上...只需创建一个具有所需多重索引的新数据框架。使用.ix，.loc，.iloc插入值在多重索引数据框架中不受支持。如果你想要完全更改列的子集的值（而不仅仅是交换），Nickil的解决方案是一定要走的路。然而，如果你只想交换列，下面的方法就可以正常运行。我将这个答案选为最佳答案，因为这个解决方案对我来说更好，因为我除了按月份分组的“FOR”之外还有其他数据，并且它给了我更灵活的列重新排序方式。

首先，按你想要的顺序存储列表：

>>reindex_list = ['STYLE','COLOR','SIZE','FOR'] #desired order
>>month_list = clean_table_grouped.ix[0:,"FOR"].columns.tolist()
>>month_list.sort(key = lambda x: x[0:2]) #sort by month ascending
>>month_list.sort(key = lambda x: x[-2:]) #sort by year ascending

然后创建一个压缩列表，其中样式、颜色、尺寸与''一起压缩，'FOR'与每个月一起压缩。就像这样：

[('STYLE',''),('COLOR',''),..., ('FOR','10/16'), ('FOR','11/16'), ...]

这里有一个自动执行的算法：

>>zip_list = []
>>
for i in reindex_list:
if i in ['FOR']:
    for j in month_list:
        if j != '':
            zip_list.append(zip([i],[j])[0])
else:
    zip_list.append(zip([i],[''])[0])

然后从刚刚压缩的元组列表中创建一个多索引：

>>multi_cols = pd.MultiIndex.from_tuples(zip_list, names=['','MONTH'])

最后，使用新的多级索引从旧数据框中创建一个新的数据框：

>>clean_table_grouped_ordered = pd.DataFrame(clean_table_grouped, columns=multi_cols)
>>clean_table_grouped_ordered[0:5]
       STYLE COLOR SIZE FOR
 MONTH                  10/16   11/16   12/16  01/17
       ####  ####  ###  15.0    15.0    15.0    0.0
       ####  ####  ###  15.0    15.0    15.0    0.0
       ####  ####  ###  15.0    15.0    15.0    0.0
       ####  ####  ###  15.0    15.0    15.0    0.0
       ####  ####  ###  15.0    15.0    15.0    0.0
       ####  ####  ###  15.0    15.0    15.0    0.0

- xdzzz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nickil Maveli · Accepted Answer

将包含日期字符串的列名进行排序，并将其用作子集，以返回按特定顺序排列的列：

from datetime import datetime
df[sorted(df.columns, key=lambda x: datetime.strptime(x, '%m/%y'))]

玩具数据：

from datetime import datetime
np.random.seed(42)

cols = [['STYLE', 'COLOR', 'SIZE', 'FOR', 'FOR', 'FOR', 'FOR'],
        ['', '', '', '01/17', '10/16', '11/16', '12/16']]
tups = list(zip(*cols))
index = pd.MultiIndex.from_tuples(tups, names=[None, 'MONTH'])
clean_table_grouped = pd.DataFrame(np.random.randint(0, 100, (100, 7)), 
                                   index=np.arange(100), columns=index)
clean_table_grouped = clean_table_grouped.head()
clean_table_grouped

将多重索引的DF拆分为两个部分，一个包含预测值，另一个包含剩余的DF。

for_df = clean_table_grouped[['FOR']]
clean_table_grouped = clean_table_grouped.drop(['FOR'], axis=1, level=0)

预测 DF:

for_df

剩余 DF:

clean_table_grouped

按照预编辑帖子中所做的相同步骤，通过对预测DF中的列进行排序。

order = sorted(for_df['FOR'].columns.tolist(), key=lambda x: datetime.strptime(x, '%m/%y'))

将列的排序列表子集化，以相同顺序生成DF。

for_df = for_df['FOR'][order]

将预测的 DF 与其本身连接起来，创建一个类似于多级索引的列。

for_df = pd.concat([for_df, for_df], axis=1, keys=['FOR'])

最后，按照共同的索引将它们连接起来。

clean_table_grouped.join(for_df)