针对列，在DataFrame中重复行。

Question

针对列，在DataFrame中重复行。

9

我有一个 Pandas DataFrame，长这样：

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      3       6       9

我想创建一个像这样的 Pandas DataFrame：

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       9
3      2       4       7
4      2       5       8
5      2       6       9
6      3       4       7
7      3       5       8
8      3       6       9

有没有内置的或内置 Pandas 方法的组合可以实现这一点？

即使在 df 中存在重复项，我也希望输出具有相同的格式。换句话说：

df
    col1    col2    col3
0      1       4       7
1      2       5       8
2      2       6       8

df_new
    col1    col2    col3
0      1       4       7
1      1       5       8
2      1       6       8
3      2       4       7
4      2       5       8
5      2       6       8
6      2       4       7
7      2       5       8
8      2       6       8

- Erniu

3

你能否提供一些有关你的扩展逻辑的详细信息？看起来你正在对col1和其余列之间进行笛卡尔积，这只是col1和[col2、col3]之间的标准交叉连接，比手动扩展要快得多。类似于df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)。这是正确的吗？还是根据列值或其他方式需要固定次数地扩展DataFrame？ - Henry Ecker

1

@HenryEcker。很好地使用了 how='cross'！它也可以是 to_tile=['col1']; df[to_tile].merge(df[df.columns.difference(to_tile)], how='cross').reset_index(drop=True)。 - Corralien

7个回答

7

我也会选择像@Henry在评论中建议的交叉合并。

out = df[['col1']].merge(df[['col2', 'col3']], how='cross').reset_index(drop=True)

输出：

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

不同方法的比较：

请注意，@sammywemmy的方法在行重复时的行为不同，这会导致时间无法比较。

- mozway

有趣的时间数据。随着长度的增加，其他解决方案似乎可能会赶上我的解决方案... - Nick

2

@sammywemmy 关于你的方法，当有重复行时，它不能提供与其他方法相同的输出（因此计时不同）。 - mozway

4

@mozway 很酷的东西。有趣的是看到itertools作为最慢的选择。我喜欢这种引起各种回应和进行计时的问题。 - Nick

1

我尝试了使用random.choices(range(100), k=100)来生成col1，col2和col3数据，并且最佳答案是@rr_goyal的回答。1000次迭代所用时间：rr_goyal 0.44秒，mozway 1.3秒，PaulS 2.87秒，Nick 8.17秒 :( 我没有安装janitor，所以无法测试。 - Nick

2

学习各种新方法感觉很棒！ - rr_goyal

显示剩余7条评论

6

一种选择是使用complete和pyjanitor：

# pip install pyjanitor
import janitor 
import pandas as pd

df.complete('col1', ('col2','col3'))
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

complete 主要用于暴露缺失的行 - 上面的输出只是一个不错的副作用。一个更为适当但相当冗长的选项是expand_grid：

# pip install pyjanitor
import janitor as jn
import pandas as pd

others = {'df1':df.col1, 'df2':df[['col2','col3']]}
jn.expand_grid(others=others).droplevel(axis=1,level=0)
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     8
3     2     4     7
4     2     5     8
5     2     6     8
6     2     4     7
7     2     5     8
8     2     6     8

- sammywemmy

3

这段代码对于非重复情况来说很简洁，但不幸的是无法正确处理其他情况。 - mozway

6

您可以将数据框的副本进行连接，每个副本中 col1 都替换为 col1 中的每个值：

out = df.drop('col1', axis=1)
out = pd.concat([out.assign(col1=c1) for c1 in df['col1']]).reset_index(drop=True)

输出：

   col2  col3  col1
0     4     7     1
1     5     8     1
2     6     9     1
3     4     7     2
4     5     8     2
5     6     9     2
6     4     7     3
7     5     8     3
8     6     9     3

如果您愿意，您可以使用以下方法将列重新排序回原始状态

out = out[['col1', 'col2', 'col3']]

- Nick

6

您可以使用np.repeat和np.tile来获得预期的输出：

import numpy as np

N = 3
cols_to_repeat = ['col1']  # 1, 1, 1, 2, 2, 2
cols_to_tile = ['col2', 'col3']  # 1, 2, 1, 2, 1, 2

data = np.concatenate([np.tile(df[cols_to_tile].values.T, N).T,
                       np.repeat(df[cols_to_repeat].values, N, axis=0)], axis=1)
out = pd.DataFrame(data, columns=cols_to_tile + cols_to_repeat)[df.columns]

输出：

>>> out
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

您可以创建一个通用函数：

def repeat(df: pd.DataFrame, to_repeat: list[str], to_tile: list[str]=None) -> pd.DataFrame:
    to_tile = to_tile if to_tile else df.columns.difference(to_repeat).tolist()

    assert df.columns.difference(to_repeat + to_tile).empty, "all columns should be repeated or tiled"

    data = np.concatenate([np.tile(df[to_tile].values.T, N).T,
                           np.repeat(df[to_repeat].values, N, axis=0)], axis=1)

    return pd.DataFrame(data, columns=to_tile + to_repeat)[df.columns]

repeat(df, ['col1'])

使用方法：

>>> repeat(df, ['col1'])
   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

- Corralien

6

另一种可能的解决方案是基于itertools.product：

from itertools import product

pd.DataFrame([[x, y[0], y[1]] for x, y in 
              product(df['col1'], zip(df['col2'], df['col3']))], 
             columns=df.columns)

输出：

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
3     2     4     7
4     2     5     8
5     2     6     9
6     3     4     7
7     3     5     8
8     3     6     9

- PaulS

0

这里有一种使用concat和keys的方法：

pd.concat([df]*len(df),keys = df.pop('col1')).reset_index(level=0)

输出：

   col1  col2  col3
0     1     4     7
1     1     5     8
2     1     6     9
0     2     4     7
1     2     5     8
2     2     6     9
0     3     4     7
1     3     5     8
2     3     6     9

- rhug123

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- rr_goyal · Accepted Answer

我很想看到更多Pythonic或“仅限于Pandas”的答案，但这个也很好用！

import pandas as pd
import numpy as np

n=3

df = pd.DataFrame({'col1': [1, 2, 3],
                   'col2': [4, 5, 6],
                   'col3': [7, 8, 9]})

# Edited and added this new method.
df2 = pd.DataFrame({df.columns[0]:np.repeat(df['col1'].values, n)})
df2[df.columns[1:]] = df.iloc[:,1:].apply(lambda x: np.tile(x, n))

""" Old method.
for col in df.columns[1:]:
   df2[col] = np.tile(df[col].values, n)

"""
print(df2)