Python Pandas中的DataFrame转换

4
我可以帮您翻译成中文。以下是需要翻译的内容:

我试图将 Pandas DataFrame 转换为一个新的 DataFrame,其中特定列的每个项目都有自己的行。例如:

转换前:

   ID             Name        Date   Location
0   0       John, Dave  01/01/1992     Mexico
1   1              Tim  07/07/1997  Australia
2   2       Mike, John  12/24/2012     Zambia
3   3  Bob, Rick, Tony  05/17/2007       Cuba
4   4            Roger  04/05/2000    Iceland
5   5           Carlos  05/24/1995       Guam

当前解决方案:

new_df = pd.DataFrame(columns = df.columns)
for index,row in df.iterrows():
    new_row = pd.DataFrame(df.loc[index]).transpose()
    target_info = df.loc[index,'Name']
    if (len(target_info.split(',')) > 1):
        for item in target_info.split(','):
            new_row.loc[index,'Name'] = item
           new_df = new_df.append(new_row)
    else:
        new_df = new_df.append(new_row)  

之后:

  ID    Name        Date   Location
0  0    John  01/01/1992     Mexico
0  0    Dave  01/01/1992     Mexico
1  1     Tim  07/07/1997  Australia
2  2    Mike  12/24/2012     Zambia
2  2    John  12/24/2012     Zambia
3  3     Bob  05/17/2007       Cuba
3  3    Rick  05/17/2007       Cuba
3  3    Tony  05/17/2007       Cuba
4  4   Roger  04/05/2000    Iceland
5  5  Carlos  05/24/1995       Guam

当然,有更优雅的解决方案吗?
2个回答

2
你可以将名字拆分成一个序列,删除现有的名字列,然后连接拆分后的名字。
# Split the 'Name' column as a Series, setting the appropriate name and index.
split_names = df['Name'].str.split(', ', expand=True).stack()
split_names.name = 'Name'
split_names.index = split_names.index.get_level_values(0)

# Drop the existing 'Name' column, and join the split names.
df.drop('Name', axis=1, inplace=True)
df = df.join(split_names)

生成的输出与您的示例相同,但名称列排在最后。如果您想要原始顺序,可以重新排序列。

   ID        Date   Location    Name
0   0  01/01/1992     Mexico    John
0   0  01/01/1992     Mexico    Dave
1   1  07/07/1997  Australia     Tim
2   2  12/24/2012     Zambia    Mike
2   2  12/24/2012     Zambia    John
3   3  05/17/2007       Cuba     Bob
3   3  05/17/2007       Cuba    Rick
3   3  05/17/2007       Cuba    Tony
4   4  04/05/2000    Iceland   Roger
5   5  05/24/1995       Guam  Carlos

1
你可以这样做:
nm = df.Name.str.split(',\s*', expand=True)
cols=list(set(df.columns) - set(['Name']))

pd.melt(df[cols].join(nm),
        id_vars=cols,
        value_vars=nm.columns.tolist(),
        value_name='Name') \
  .dropna() \
  .drop(['variable'], axis=1) \
  .sort_values('ID')

逐步操作:
In [128]: nm = df.Name.str.split(',\s*', expand=True)

In [129]: nm
Out[129]:
        0     1     2
0    John  Dave  None
1     Tim  None  None
2    Mike  John  None
3     Bob  Rick  Tony
4   Roger  None  None
5  Carlos  None  None

In [130]: cols=list(set(df.columns) - set(['Name']))

In [131]: cols
Out[131]: ['Date', 'ID', 'Location']

In [133]: pd.melt(df[cols].join(nm),
   .....:         id_vars=cols,
   .....:         value_vars=nm.columns.tolist(),
   .....:         value_name='Name') \
   .....:   .dropna() \
   .....:   .drop(['variable'], axis=1) \
   .....:   .sort_values('ID')
Out[133]:
          Date  ID   Location    Name
0   01/01/1992   0     Mexico    John
6   01/01/1992   0     Mexico    Dave
1   07/07/1997   1  Australia     Tim
2   12/24/2012   2     Zambia    Mike
8   12/24/2012   2     Zambia    John
3   05/17/2007   3       Cuba     Bob
9   05/17/2007   3       Cuba    Rick
15  05/17/2007   3       Cuba    Tony
4   04/05/2000   4    Iceland   Roger
5   05/24/1995   5       Guam  Carlos

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接