Python Pandas合并数据框中同名列

Question

Python Pandas合并数据框中同名列

10

我是一名有用的助手，可以翻译文本。

所以我有一些CSV文件需要处理，但其中一些具有相同名称的多个列。

例如，我可能有这样的csv：

ID   Name   a    a    a     b    b
1    test1  1    NaN  NaN   "a"  NaN
2    test2  NaN  2    NaN   "a"  NaN
3    test3  2    3    NaN   NaN  "b"
4    test4  NaN  NaN  4     NaN  "b"

将其翻译为中文：在将数据加载到Pandas时，给了我这个：

ID   Name   a    a.1  a.2   b    b.1
1    test1  1    NaN  NaN   "a"  NaN
2    test2  NaN  2    NaN   "a"  NaN
3    test3  2    3    NaN   NaN  "b"
4    test4  NaN  NaN  4     NaN  "b"

我想要做的是将相同名称的列合并为一列（如果有多个值，则保持这些值分开），我的理想输出结果是这样的。

ID   Name   a      b  
1    test1  "1"    "a"   
2    test2  "2"    "a"
3    test3  "2;3"  "b"
4    test4  "4"    "b"

所以想知道这是否有可能？

- Wizuriel

1

你需要使用 df.columns=['ID', 'Name', 'a','a','a','b','b'] 以获取类似于第一个表格的 DataFrame。 - CT Zhu

5个回答

5

可能使用重复的列名不是一个好主意，但它可以工作:

In [72]:

df2=df[['ID', 'Name']]
df2['a']='"'+df.T[df.columns.values=='a'].apply(lambda x: ';'.join(["%i"%item for item in x[x.notnull()]]))+'"' #these columns are of float dtype
df2['b']=df.T[df.columns.values=='b'].apply(lambda x: ';'.join([item for item in x[x.notnull()]])) #these columns are of objects dtype
print df2
   ID   Name      a    b
0   1  test1    "1"  "a"
1   2  test2    "2"  "a"
2   3  test3  "2;3"  "b"
3   4  test4    "4"  "b"

[4 rows x 4 columns]

- CT Zhu

5

当然，DSM和CTZhu有非常简洁的答案，利用Python的许多内建特性以及dataframe的特殊功能。这里稍微啰嗦一点。

def myJoiner(row):
    newrow = []
    for r in row:
        if not pandas.isnull(r):
            newrow.append(str(r))
    return ';'.join(newrow)

def groupCols(df, key):
    columns = df.select(lambda col: key in col, axis=1)
    joined = columns.apply(myJoiner, axis=1)
    joined.name = key
    return pandas.DataFrame(joined)

import pandas 
from io import StringIO  # python 3.X
#from StringIO import StringIO #python 2.X

data = StringIO("""\
ID   Name   a    a    a     b    b
1    test1  1    NaN  NaN   "a"  NaN
2    test2  NaN  2    NaN   "a"  NaN
3    test3  2    3    NaN   NaN  "b"
4    test4  NaN  NaN  4     NaN  "b"
""")

df = pandas.read_table(data, sep='\s+')
df.set_index(['ID', 'Name'], inplace=True)


AB = groupCols(df, 'a').join(groupCols(df, 'b'))
print(AB)

这让我：

                a  b
ID Name             
1  test1      1.0  a
2  test2      2.0  a
3  test3  2.0;3.0  b
4  test4      4.0  b

- Paul H

3

对之前答案的进一步解释：

从read_csv读入的列会带有后缀以使它们保持唯一性，正如你所看到的 a.0, a.1, a.2 等。

你可能需要传递一个函数给 group_by，以处理这个问题，例如：

df = pd.read_csv('data.csv') #csv file with multiple columns of the same name

#function to join columns if column is not null
def sjoin(x): return ';'.join(x[x.notnull()].astype(str))

#function to ignore the suffix on the column e.g. a.1, a.2 will be grouped together
def groupby_field(col):
    parts = col.split('.')
    return '{}'.format(parts[0])

df = df.groupby(groupby_field, axis=1,).apply(lambda x: x.apply(sjoin, axis=1))

- Andrew Alger

1

如果您想修补数据框，则可以执行以下操作：

# consolidated columns, replacing instead of joining by ;
s_fixed_a = df['a'].fillna(df['a.1']).fillna(df['a.2'])
s_fixed_b = df['b'].fillna(df['b.1'])
# create new df
df_resulting = df[['Id', 'Name']].merge(s_fixed_a, left_index=True, right_index=True).merge(s_fixed_b, left_index=True, right_index=True)

- mancvso

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DSM · Accepted Answer

你可以在axis=1上使用groupby，并尝试如下操作：

>>> def sjoin(x): return ';'.join(x[x.notnull()].astype(str))
>>> df.groupby(level=0, axis=1).apply(lambda x: x.apply(sjoin, axis=1))
  ID   Name        a  b
0  1  test1      1.0  a
1  2  test2      2.0  a
2  3  test3  2.0;3.0  b
3  4  test4      4.0  b

你可以使用任何格式化运算符，而不是使用.astype(str)。