使用pandas将数据从宽格式转换为长格式

5

有很多标题相似的问题,但是我无法解决我的数据集存在的问题。

数据集:

ID   Country Type Region Gender IA01_Raw  IA01_Class1  IA01_Class2 IA02_Raw IA02_Class1 IA02_Class2 QA_Include QA_Comments

SC1  France  A    Europe Male   4         8            1            J         4            1           yes       N/A
SC2  France  A    Europe Female 2         7            2            Q         6            4           yes       N/A
SC3  France  B    Europe Male   3         7            2            K         8            2           yes       N/A
SC4  France  A    Europe Male   4         8            2            A         2            1           yes       N/A
SC5  France  B    Europe Male   1         7            1            F         1            3           yes       N/A
ID6  France  A    Europe Male   2         8            1            R         3            7           yes       N/A
ID7  France  B    Europe Male   2         8            1            Q         4            6           yes       N/A
UC8  France  B    Europe Male   4         8            2            P         4            2           yes       N/A

需要输出:

ID   Country Type Region Gender IA Raw Class1 Class2 QA_Include QA_Comments

SC1  France  A    Europe Male   01 K   8      1      yes        N/A
SC1  France  A    Europe Male   01 L   8      1      yes       N/A
SC1  France  A    Europe Male   01 P   8      1      yes       N/A
SC1  France  A    Europe Male   02 Q   8      1      yes       N/A
SC1  France  A    Europe Male   02 R   8      1      yes       N/A
SC1  France  A    Europe Male   02 T   8      1      yes       N/A
SC1  France  A    Europe Male   03 G   8      1      yes       N/A
SC1  France  A    Europe Male   03 R   8      1      yes       N/A
SC1  France  A    Europe Male   03 G   8      1      yes       N/A
SC1  France  A    Europe Male   04 K   8      1      yes       N/A
SC1  France  A    Europe Male   04 A   8      1      yes       N/A
SC1  France  A    Europe Male   04 P   8      1      yes       N/A
SC1  France  A    Europe Male   05 R   8      1      yes       N/A
....

数据集中有列名为IA[X]_NAME, 其中X = 1..9NAME = Raw, Class1Class2

我的目标是将这些列进行转置,使其看起来像所需输出表格中的样子,即IA将显示X值,正如rawclasses将显示它们各自的值。

为了实现这个目标,我对这些列进行了切片,分别是:

idVars = list(excel_df_final.columns[0:40]) + list(excel_df_final.columns[472:527]) #These contain columns like ID, Country, Type etc
valueVars = excel_df_final.columns[41:472].tolist() #All the IA_ columns

我不知道这一步是否有必要,但这给了我完美的列切片,但当我将其放入melt中时,它不能正常工作。我已经尝试了几乎所有其他问题中可用的方法。

pd.melt(excel_df_final, id_vars=idVars,value_vars=valueVars)

我也尝试过这个:

excel_df_final.set_index(idVars)[41:472].unstack()

但是并没有起作用,以下是宽转长实现,也没有起作用:

pd.wide_to_long(excel_df_final, stubnames = ['IA', 'Raw', 'Class1', 'Class2'], i=idVars, j=valueVars)

我在使用 wide to long 方法时遇到了错误:

ValueError: operands could not be broadcast together with shapes (95,) (431,)

我的数据集实际上有 526 列,因此我将它们分成了两个列表,其中一个包含了 95 个列名作为 i,剩下的 431 个列则需要按照示例数据集中所示的行显示。


你没有对数据集进行转置。不要称其为“transpose”。转置是 df.T - cs95
是的,但我基本上想将一些列转换为行。所以我想不到另一个词。 - Chaudhry Talha
1
宽变长... - cs95
@cs95 谢谢你提供正确的术语。我已经相应地更新了问题,也实现了它,但它没有起作用。请看看是否可以帮助我解决这个问题。 - Chaudhry Talha
1
这并不容易回答,因为您期望针对拥有500列的实际数据提供解决方案。您无法在此网站上分享一个包含500列的数据集,并合理地期望任何人能够为其工作提供解决方案。从学习的角度来看,如果您能够提供一个不超过5列的简化版本,并更清楚地解释您的问题,那么对您会更有益处,对我们(因为我们是无偿志愿者)也更容易。请参阅如何提供一个[mcve]。 - cs95
1
完全同意,因为我分享的数据集只是我在Excel表格中创建的500多列的样本,以进行实验。如果我能得到这个样本的解决方案,我就能将其转换为更大的数据集。看一下,我现在已经把它缩短了。 - Chaudhry Talha
2个回答

2
这是一个很好的起点。关键在于使用set_index,将列转换为MultiIndex,然后使用stack。也许存在更好的解决方案,但我会这样做,因为这是输出的简单步骤。"Original Answer"翻译成"最初的回答"。
# Set the index with columns that we don't want to "transpose"
df2 = df.set_index([
   'ID', 'Country', 'Type', 'Region', 'Gender', 'QA_Include', 'QA_Comments'])
# Convert headers to MultiIndex -- this is so we can melt IA values
df2.columns = pd.MultiIndex.from_tuples(map(tuple, df2.columns.str.split('_')))
# Call stack to replicate data, then reset the index
out =  df2.stack(level=0).reset_index().rename({'level_7': 'IA'}, axis=1)

out

     ID Country Type  Region  Gender QA_Include  QA_Comments    IA  Class1  Class2 Raw
0   SC1  France    A  Europe    Male        yes          NaN  IA01       8       1   4
1   SC1  France    A  Europe    Male        yes          NaN  IA02       4       1   J
2   SC2  France    A  Europe  Female        yes          NaN  IA01       7       2   2
3   SC2  France    A  Europe  Female        yes          NaN  IA02       6       4   Q
4   SC3  France    B  Europe    Male        yes          NaN  IA01       7       2   3
5   SC3  France    B  Europe    Male        yes          NaN  IA02       8       2   K
6   SC4  France    A  Europe    Male        yes          NaN  IA01       8       2   4
7   SC4  France    A  Europe    Male        yes          NaN  IA02       2       1   A
8   SC5  France    B  Europe    Male        yes          NaN  IA01       7       1   1
9   SC5  France    B  Europe    Male        yes          NaN  IA02       1       3   F
10  ID6  France    A  Europe    Male        yes          NaN  IA01       8       1   2
11  ID6  France    A  Europe    Male        yes          NaN  IA02       3       7   R
12  ID7  France    B  Europe    Male        yes          NaN  IA01       8       1   2
13  ID7  France    B  Europe    Male        yes          NaN  IA02       4       6   Q
14  UC8  France    B  Europe    Male        yes          NaN  IA01       8       2   4
15  UC8  France    B  Europe    Male        yes          NaN  IA02       4       2   P

1
@anky_91 很难想象这种方法如何扩展到500列,但我决定尝试一下,因为原帖作者已经努力提出了一个不错的问题,并且也回应了评论...那就试试吧 :) - cs95
@cs95 非常感谢。我尝试了一下,效果很好。但是有一个例外,因为我在问题中没有提到,所以没关系。每个“IA”列都有另一个“_”附加在其后,例如“IA02_Raw_baseline”、“IA02_Raw_midline”等。我已经在样本表https://docs.google.com/spreadsheets/d/11HUR8y30CL7sZxsadK2LibwNbN0G6RcwtTl8SAcszHs/edit?usp=sharing上进行了更新,如果您可以相应地编辑您的答案,我将非常感激 :) - Chaudhry Talha
2
@ChaudhryTalha 应该使用 str.split('_',n=1),然后再按照之前的方式进行。 - cs95
@cs95 你太棒了 :D 非常感谢你。 - Chaudhry Talha

1
您可以使用 pd.lreshape
pd.lreshape(df.assign(IA01=['01']*len(df), IA02=['02']*len(df),IA09=['09']*len(df)), 
            {'IA': ['IA01', 'IA02','IA09'],
             'Raw': ['IA01_Raw','IA02_Raw','IA09_Raw'], 
             'Class1': ['IA01_Class1','IA02_Class1','IA09_Class1'], 
             'Class2': ['IA01_Class2', 'IA02_Class2','IA09_Class2']
             })


edit : 

pd.lreshape(df.assign(IA01=['01']*len(df), IA02=['02']*len(df),IA09=['09']*len(df)), 
            {'IA': ['IA01', 'IA02','IA09'],
             'Raw': ['IA01_Raw_baseline','IA02_Raw_midline','IA09_Raw_whatever'], 
             'Class1': ['IA01_Class1_baseline','IA02_Class1_midline','IA09_Class1_whatever'], 
             'Class2': ['IA01_Class2_baseline', 'IA02_Class2_midline','IA09_Class2_whatever']
             })

编辑:只需将你想要从输入中添加到输出的任何列的列名添加到字典内的列表中即可。

此功能没有提供文档。请使用help(pd.lreshape)或参考这里

输出:

    Country Gender  ID  QA_Comments QA_Include  Region  Type    IA  Raw Class1  Class2
0   France  Male    SC1 NaN         yes         Europe  A       01  4   8       1
1   France  Female  SC2 NaN         yes         Europe  A       01  2   7       2
2   France  Male    SC3 NaN         yes         Europe  B       01  3   7       2
3   France  Male    SC4 NaN         yes         Europe  A       01  4   8       2
4   France  Male    SC5 NaN         yes         Europe  B       01  1   7       1
5   France  Male    ID6 NaN         yes         Europe  A       01  2   8       1
6   France  Male    ID7 NaN         yes         Europe  B       01  2   8       1
7   France  Male    UC8 NaN         yes         Europe  B       01  4   8       2
8   France  Male    SC1 NaN         yes         Europe  A       02  J   4       1
9   France  Female  SC2 NaN         yes         Europe  A       02  Q   6       4
10  France  Male    SC3 NaN         yes         Europe  B       02  K   8       2
11  France  Male    SC4 NaN         yes         Europe  A       02  A   2       1
12  France  Male    SC5 NaN         yes         Europe  B       02  F   1       3
13  France  Male    ID6 NaN         yes         Europe  A       02  R   3       7
14  France  Male    ID7 NaN         yes         Europe  B       02  Q   4       6
15  France  Male    UC8 NaN         yes         Europe  B       02  P   4       2
16  France  Male    SC1 NaN         yes         Europe  A       09  W   6       3
17  France  Female  SC2 NaN         yes         Europe  A       09  X   5       2
18  France  Male    SC3 NaN         yes         Europe  B       09  Y   5       5
19  France  Male    SC4 NaN         yes         Europe  A       09  P   5       2
20  France  Male    SC5 NaN         yes         Europe  B       09  T   5       2
21  France  Male    ID6 NaN         yes         Europe  A       09  I   5       2
22  France  Male    ID7 NaN         yes         Europe  B       09  A   8       2
23  France  Male    UC8 NaN         yes         Europe  B       09  K   7       5

请参考我制作的样例数据集,查看数据集和期望结果:https://docs.google.com/spreadsheets/d/11HUR8y30CL7sZxsadK2LibwNbN0G6RcwtTl8SAcszHs/edit?usp=sharing 并更新你的答案。非常感谢! - Chaudhry Talha
1
假设您的“IA09_Raw”列名为“IA09_Raw_whatever”,请更新。 - Shijith

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接