重命名Pandas DataFrame中重复的索引值

12

我有一个包含重复索引值的数据帧(DataFrame):

df1 =  pd.DataFrame( np.random.randn(6,6), columns = pd.date_range('1/1/2010', periods=6), index = {"A", "B", "C", "D", "E", "F"})
df1.rename(index = {"C": "A", "B": "E"}, inplace = 1)

ipdb> df1
      2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
 A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
 A   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
 E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
 F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
 D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
 E   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

我只想更改重复值的名称,并获得如下所示的DataFrame:

ipdb> df1
     2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A_dp   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E_dp   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

我的方法:

(i) 创建一个新名称的字典

old_names = df1[df1.index.duplicated()].index.values
new_names = df1[df1.index.duplicated()].index.values + "_dp"
dictionary = dict(zip(old_names, new_names))

(ii) 仅重命名重复值

df1.loc[df1.index.duplicated(),:].rename(index = dictionary, inplace = True)

但是这似乎不起作用。

4个回答

24
你可以使用 Index.where:
df1.index = df1.index.where(~df1.index.duplicated(), df1.index + '_dp')
print (df1)
      2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A      -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A_dp   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E      -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F       1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D       0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E_dp   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

如果需要删除重复的索引以保持唯一性:

print (df1)
   2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A   -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F    1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D    0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

df1.index = df1.index + df1.groupby(level=0).cumcount().astype(str).replace('0','')
print (df1)
    2010-01-01  2010-01-02  2010-01-03  2010-01-04  2010-01-05  2010-01-06
A    -1.163883    0.593760    2.323342   -0.928527    0.058336   -0.209101
A1   -0.593566   -0.894161   -0.789849    1.452725    0.821477   -0.738937
E    -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E1   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
E2   -0.670305   -1.788403    0.134790   -0.270894    0.672948    1.149089
F     1.707686    0.323213    0.048503    1.168898    0.002662   -1.988825
D     0.403028   -0.879873   -1.809991   -1.817214   -0.012758    0.283450
E3   -0.224405   -1.803301    0.582946    0.338941    0.798908    0.714560

2
非常好的答案! - piRSquared
2
@piRSquared - 谢谢。 - jezrael
1
感谢您的快速优雅的回答! - Miquel
1
这个解决方案同样适用于需要重命名的任意列。df1['col'] = df1['col'] + df1.groupby(['col']).cumcount().astype(str).replace('0','') - DuCorey
1
如果您想要将所有重复项修改为从1到n(即E1,E2,E3而不是E,E1,E2),这是解决方案:df1.loc[df1.duplicated('col', keep=False),'col] = df1.loc[df1.duplicated('col', keep=False),'col] + (df1[df1.duplicated('col', keep=False)].groupby("col").cumcount() + 1).astype(str) - mah65

4
我在这个重命名函数中使用了jezrael的优秀答案:
def rn(df, suffix = '-duplicate-'):
    appendents = (suffix + df.groupby(level=0).cumcount().astype(str).replace('0','')).replace(suffix, '')
    return df.set_index(df.index + appendents)

然后是这个:
df = pd.DataFrame({'a':[1,2,3,4,5,6,7,8, 9]}, index=['a'+str(i) for i in [1,2,3,3,4,3,5,5, 6]])
rn(df)

将其输出为:

    a
a1  1
a2  2
a3  3
a3-duplicate-1  4
a4  5
a3-duplicate-2  6
a5  7
a5-duplicate-1  8
a6  9

0
进一步改进自 @kotrfa answer,我们可以使用前缀和后缀使函数更加灵活。这种方法也更像是在Windows文件下载时的重复添加额外的 (n)。这种方法还支持从 *args**kwargs 进行原地操作。
def rename_duplicates(df, prefix=' (', suffix=')', *args, **kwargs):
    appendents = (prefix
                  + (df.groupby(level=0).cumcount() + 1).astype(str)
                  + suffix
                 ).replace(f'{prefix}1{suffix}', '')
    return df.set_index(df.index + appendents, *args, **kwargs)

0
这是一个根据出现顺序来重命名数据框中任何重复行的函数。在使用该函数之前,您可以使用df = df.reset_index()将数据框重置,并设置column='index'以将其用于索引。
def rename_duplicates(df, column='column_name'):
    df = df.copy()
    duplicate_rows = df[column].duplicated(keep=False)
    duplicate_indices = df[duplicate_rows].index

    duplicates_df = df.loc[duplicate_indices, :]
    duplicates_df.loc[:, column] = duplicates_df[column] + '_' + duplicates_df.reset_index().index.astype('str')
    df.loc[duplicate_indices, column] = duplicates_df[column]
    return df

如果你有两个名为'apple'的条目,现在它们将变成'apple_0'和'apple_1'等等。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接