将不同行的列合并为一个行，并按特定列分组。

Question

将不同行的列合并为一个行，并按特定列分组。

pythonpython-3.xpandasdataframepandas-groupby

3

我有以下数据框：

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})


    day    column0  column1 column2 column3
0   monday  xx       yy       cc      cc
1   monday  xx       aa       cc    
2   Tuesday          bb       cc      aa

我想按天分组并将列连接在行中，同时保留行作为索引列。

预期结果1：

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "Tuesday" ],
        "index":   ["0,1",          "2" ],
        "column0": ["xx",             ""],
        "column1": ["yy",           "bb"],
        "column2": ["cc",           "cc"],
        "column3": ["cc",           "aa"],
        "column4": ["xx",             ""],
        "column5": ["aa",             ""],
        "column6": ["cc",             ""]})

    day   index column0 column1 column2 column3 column4 column5 column6
0   monday  0,1   xx       yy     cc      cc      xx      aa    cc
1   Tuesday 2              bb     cc      aa

最终，我希望删除每行相同的值并将NAN添加到空白列。

最终结果：

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "Tuesday" ],
        "index":   ["0,1",          "2" ],
        "column0": ["xx",          "NAN"],
        "column1": ["yy",           "bb"],
        "column2": ["cc",           "cc"],
        "column3": ["NAN",          "aa"],
        "column5": ["aa",          "NAN"]})

    day   index column0 column1 column2  column3    column4
0   monday  0,1   xx      yy          cc    NAN       aa
1   Tuesday 2    NAN      bb          cc    aa        NAN

有什么想法吗？

- xavi

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tranbi · Accepted Answer

你可以使用numpy来展平你的分组数据框。然后将它们存储在一个列表中，并从中创建一个数据框。

最后，你可以用NaN替换""和None，删除NaN列并重命名你的列：

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})

arr_list = []
for d, sub_df in df1.groupby("day"):
  arr = list(np.array(sub_df.iloc[:,1:]).flatten())
  arr = [d, list(sub_df.index)] + arr
  arr_list.append(arr)

df = pd.DataFrame(arr_list)
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)

输出：

       day   index column0 column1 column2 column3 column4 column5 column6
0  Tuesday     [2]     NaN      bb      cc      aa     NaN     NaN     NaN
1   monday  [0, 1]      xx      yy      cc      cc      xx      aa      cc

编辑：如果您想要在每一行中去除重复项，请在将数组展开后执行该操作。

同时，您可以在groupby中指定sort=False以保留原始顺序：

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})

arr_list = []
for d, sub_df in df1.groupby("day", sort=False):
  # flattening the grouped dataframe ([:,1:] => all rows, all column except the first one: day)
  arr = list(np.array(sub_df.iloc[:,1:]).flatten())
  # removing duplicates for this row:
  arr_unique = []
  for x in arr:
    if not x in arr_unique:
      arr_unique.append(x)
    else: # appending NaN to keep dataframe form
      arr_unique.append(np.nan)
  # re-appending day and adding the indexes of the grouped rows:
  arr = [d, list(sub_df.index)] + arr_unique
  arr_list.append(arr)

df = pd.DataFrame(arr_list)
# replacing '' with NaN and dropping NaN columns:
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
# renaming columns, the first two are 'day' and 'index' the rest is generated: columnX where X goes from 0 to the nb of column minus 2 (since we already named two columns)
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)

输出：

       day   index column0 column1 column2 column3 column4
0   monday  [0, 1]      xx      yy      cc     NaN      aa
1  Tuesday     [2]     NaN      bb      cc      aa     NaN