重塑Pandas数据框架

Question

重塑Pandas数据框架

17

假设有这样一个数据框：

df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])

我希望有一个看起来像这样的数据框：

什么不起作用：

new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')

当然，我可以循环遍历数据并创建一个新的列表，但肯定有更好的方法。你有什么想法吗？

- Moritz

我添加了一个更加健壮的答案，可以概括地适用于你所面临的几乎相同的情况。 - Ted Petrou

4个回答

10

你可以使用lreshape，针对列id使用numpy.repeat：

a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})

df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index))  + 1
print (df1)
    A   B  id
0   1   2   1
1   5   6   1
2   9  10   1
3   3   4   2
4   7   8   2
5  11  12   2

编辑:

lreshape目前没有文档，但有可能会被删除（和pd.wide_to_long一起）。

可能的解决方案是将这3个函数合并为一个，也许是melt，不过现在还没有实现。也许在某个新版本的pandas中会实现。那时我会更新我的回答。

- jezrael

4

我用以下3个步骤解决了这个问题：

创建一个新的数据框df2，只包含要添加到初始数据框df中的数据。
从df中删除将要在下方添加的数据（也是用来制作df2的数据）。
将df2附加到df中。

代码示例：

# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']

# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)

# step 3: append
df = df.append(df2, ignore_index=True)

注意，当你使用 df.append() 时，需要指定 ignore_index=True ，这样新的列将被附加到索引而不是保留它们的旧索引。

你的最终结果应该是你想要的数据重新排列后的原始数据框：

In [16]: df
Out[16]:
    A   B
0   1   2
1   5   6
2   9  10
3   3   4
4   7   8
5  11  12

- mprat

1

使用 pd.concat() 如下：

#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up

# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)

# Concatenate
pd.concat([df_1, df_2])

- Matthew

@Moritz - 我明白了。个人而言，我会用for循环来解决这个问题。不过也许@jezrael的“lreshape”解决方案对于这种情况更好。 - Matthew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ted Petrou · Accepted Answer

pd.wide_to_long 函数几乎是为这种情况量身定制的，其中你有许多相同的变量前缀，以不同的数字后缀结尾。唯一的不同之处在于，您的第一组变量没有后缀，因此您需要先重命名列。

pd.wide_to_long 的唯一问题是它必须具有识别变量 i，而不像 melt。使用 reset_index 创建一个独特的标识列，稍后将其删除。我认为这可能会在未来得到纠正。

df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
  .reset_index()[['A', 'B', 'id']]

    A   B id
0   1   2  1
1   5   6  1
2   9  10  1
3   3   4  2
4   7   8  2
5  11  12  2