将不同行的列合并为一个行,并按特定列分组。

3
我有以下数据框:
df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})


    day    column0  column1 column2 column3
0   monday  xx       yy       cc      cc
1   monday  xx       aa       cc    
2   Tuesday          bb       cc      aa

我想按天分组并将列连接在行中,同时保留行作为索引列。
预期结果1:
df1 = pd.DataFrame(
    {   
        "day":     ["monday", "Tuesday" ],
        "index":   ["0,1",          "2" ],
        "column0": ["xx",             ""],
        "column1": ["yy",           "bb"],
        "column2": ["cc",           "cc"],
        "column3": ["cc",           "aa"],
        "column4": ["xx",             ""],
        "column5": ["aa",             ""],
        "column6": ["cc",             ""]})

    day   index column0 column1 column2 column3 column4 column5 column6
0   monday  0,1   xx       yy     cc      cc      xx      aa    cc
1   Tuesday 2              bb     cc      aa            

最终,我希望删除每行相同的值并将NAN添加到空白列。
最终结果:
df1 = pd.DataFrame(
    {   
        "day":     ["monday", "Tuesday" ],
        "index":   ["0,1",          "2" ],
        "column0": ["xx",          "NAN"],
        "column1": ["yy",           "bb"],
        "column2": ["cc",           "cc"],
        "column3": ["NAN",          "aa"],
        "column5": ["aa",          "NAN"]})

    day   index column0 column1 column2  column3    column4
0   monday  0,1   xx      yy          cc    NAN       aa
1   Tuesday 2    NAN      bb          cc    aa        NAN

有什么想法吗?
1个回答

1
你可以使用numpy来展平你的分组数据框。然后将它们存储在一个列表中,并从中创建一个数据框。
最后,你可以用NaN替换""和None,删除NaN列并重命名你的列:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})

arr_list = []
for d, sub_df in df1.groupby("day"):
  arr = list(np.array(sub_df.iloc[:,1:]).flatten())
  arr = [d, list(sub_df.index)] + arr
  arr_list.append(arr)

df = pd.DataFrame(arr_list)
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)

输出:

       day   index column0 column1 column2 column3 column4 column5 column6
0  Tuesday     [2]     NaN      bb      cc      aa     NaN     NaN     NaN
1   monday  [0, 1]      xx      yy      cc      cc      xx      aa      cc
编辑:如果您想要在每一行中去除重复项,请在将数组展开后执行该操作。

同时,您可以在groupby中指定sort=False以保留原始顺序:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})

arr_list = []
for d, sub_df in df1.groupby("day", sort=False):
  # flattening the grouped dataframe ([:,1:] => all rows, all column except the first one: day)
  arr = list(np.array(sub_df.iloc[:,1:]).flatten())
  # removing duplicates for this row:
  arr_unique = []
  for x in arr:
    if not x in arr_unique:
      arr_unique.append(x)
    else: # appending NaN to keep dataframe form
      arr_unique.append(np.nan)
  # re-appending day and adding the indexes of the grouped rows:
  arr = [d, list(sub_df.index)] + arr_unique
  arr_list.append(arr)

df = pd.DataFrame(arr_list)
# replacing '' with NaN and dropping NaN columns:
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
# renaming columns, the first two are 'day' and 'index' the rest is generated: columnX where X goes from 0 to the nb of column minus 2 (since we already named two columns)
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)

输出:

       day   index column0 column1 column2 column3 column4
0   monday  [0, 1]      xx      yy      cc     NaN      aa
1  Tuesday     [2]     NaN      bb      cc      aa     NaN

在最终结果中,是否可以像我的示例一样保持“顺序”?所以先是星期一,然后是星期二。 - xavi
如果您在 groupby 中传入参数 sort=False,则代码将正常运行:for d, sub_df in df1.groupby("day", sort=False):。请立即编辑我的回答并进行更改。 - Tranbi
在你的编辑部分中,你是如何将它转换为数据框架的? - xavi
在编辑部分的代码中,是的。实际上只是循环,转换为数据框发生在之后。(请参见第一个代码片段) - Tranbi
我不太理解你需要什么。在这个例子中,因为我们按天分组,星期二只会出现一次。请尝试创建一个带有详细描述问题的新问题(包括最小可重现示例、输入和期望输出)。 - Tranbi
显示剩余4条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接