基于多行修改Python中的pandas数据帧

Question

基于多行修改Python中的pandas数据帧

3

我正在使用Pandas/Python中的DataFrame，每行都有一个ID（不唯一），我想修改数据框以添加一列，该列包含具有多个匹配ID的每行对应的姓氏信息。

Starting with:

   ID Name  Rate
0   1    A  65.5
1   2    B  67.3
2   2    C  78.8
3   3    D  65.0
4   4    E  45.3
5   5    F  52.0
6   5    G  66.0
7   6    H  34.0
8   7    I   2.0

Trying to get to:

   ID Name  Rate Secondname
0   1    A  65.5       None
1   2    B  67.3       C
2   2    C  78.8       B
3   3    D  65.0       None
4   4    E  45.3       None
5   5    F  52.0       G
6   5    G  66.0       F
7   6    H  34.0       None
8   7    I   2.0       None

我的代码：

import numpy as np
import pandas as pd


mydict = {'ID':[1,2,2,3,4,5,5,6,7],
             'Name':['A','B','C','D','E','F','G','H','I'],
             'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}

df=pd.DataFrame(mydict)

df['Newname']='None'

for i in range(0, df.shape[0]-1):
    if df.irow(i)['ID']==df.irow(i+1)['ID']:       
        df.irow(i)['Newname']=df.irow(i+1)['Name']

这会导致以下错误：

A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df.irow(i)['Newname']=df.irow(i+1)['Secondname']
C:\Users\L\Anaconda3\lib\site-packages\pandas\core\series.py:664:     SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the the caveats in the documentation: http://pandas.pydata.org/pandas- docs/stable/indexing.html#indexing-view-versus-copy
self.loc[key] = value

非常感谢您的帮助。

- LJH11

2个回答

0

如果您的数据框按ID排序，您可以添加一个新列，该列比较当前行的ID与下一行的ID：

# cpIDs = True if the next row has the same ID
df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
# fill in the last row (get rid of NaN)
df.iloc[-1,df.columns.get_loc('cpIDs')] = False
# ShiftName == Name of the next row
df['ShiftName'] = df['Name'].shift(-1)
# fill in SecondName
df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
# remove columns
del df['cpIDs']
del df['ShiftName']

当然，你可以缩短上面的代码，因为我故意让它更长，但更易理解。根据你的数据框大小，它可能非常快（也许是最快的），因为它不使用任何复杂的操作。

P.S. 顺便提一下，在处理数据框和numpy数组时尽量避免使用任何循环。几乎总是可以找到所谓的向量解决方案，它对整个数组或大范围进行操作，而不是对单个单元格和行进行操作。

- Roman Kh

请检查您的答案，因为它的输出是错误的。 - jezrael

是的，你说得对，我有点误解了任务。尽管如此，我会保留我的答案，因为你的第二个解决方案非常好。 - Roman Kh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

你可以使用自定义函数f来使用groupby，该函数使用了shift和combine_first。

def f(x):
    #print x
    x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
    return x

print df.groupby('ID').apply(f)
   ID Name  Rate Secondname
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN

您可以避免使用groupby，并查找duplicated，然后使用loc和列Name填充辅助列，接着使用shift和combine_first，最后使用drop删除辅助列。

print df.duplicated('ID', keep='first')
0    False
1    False
2     True
3    False
4    False
5    False
6     True
7    False
8    False
dtype: bool   
print df.duplicated('ID', keep='last')
0    False
1     True
2    False
3    False
4    False
5     True
6    False
7    False
8    False
dtype: bool  
df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
print df
   ID Name  Rate   first   last
0   1    A  65.5  NaN  NaN
1   2    B  67.3  NaN    B
2   2    C  78.8    C  NaN
3   3    D  65.0  NaN  NaN
4   4    E  45.3  NaN  NaN
5   5    F  52.0  NaN    F
6   5    G  66.0    G  NaN
7   6    H  34.0  NaN  NaN
8   7    I   2.0  NaN  NaN

df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
df = df.drop(['first', 'l1'], axis=1)

print df
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN

测试: （在测试期间，Roman Kh的解决方案输出错误）

len(df) = 9：

In [154]: %timeit jez(df1)
100 loops, best of 3: 15 ms per loop

In [155]: %timeit jez2(df2)
100 loops, best of 3: 3.45 ms per loop

In [156]: %timeit rom(df)
100 loops, best of 3: 3.55 ms per loop

len(df) = 90k:

In [158]: %timeit jez(df1)
10 loops, best of 3: 57.1 ms per loop

In [159]: %timeit jez2(df2)
10 loops, best of 3: 36.4 ms per loop

In [160]: %timeit rom(df)
10 loops, best of 3: 40.4 ms per loop

import pandas as pd

mydict = {'ID':[1,2,2,3,4,5,5,6,7],
             'Name':['A','B','C','D','E','F','G','H','I'],
             'Rate':[65.5,67.3,78.8,65,45.3,52,66,34,2]}

df=pd.DataFrame(mydict)
print df


df =  pd.concat([df]*10000).reset_index(drop=True)

df1 = df.copy()
df2 = df.copy()

def jez(df):
    def f(x):
        #print x
        x['Secondname'] = x['Name'].shift(1).combine_first(x['Name'].shift(-1))
        return x

    return df.groupby('ID').apply(f)


def jez2(df): 
    #print df.duplicated('ID', keep='first')
    #print df.duplicated('ID', keep='last')
    df.loc[ df.duplicated('ID', keep='first'), 'first'] = df['Name']
    df.loc[ df.duplicated('ID', keep='last'), 'last'] = df['Name']
    #print df

    df['SecondName'] = df['first'].shift(-1).combine_first(df['last'].shift(1))
    df = df.drop(['first', 'last'], axis=1)
    return df



def rom(df):

    # cpIDs = True if the next row has the same ID
    df['cpIDs'] = df['ID'][:-1] == df['ID'][1:]
    # fill in the last row (get rid of NaN)
    df.iloc[-1,df.columns.get_loc('cpIDs')] = False
    # ShiftName == Name of the next row
    df['ShiftName'] = df['Name'].shift(-1)
    # fill in SecondName
    df.loc[df['cpIDs'], 'SecondName'] = df.loc[df['cpIDs'], 'ShiftName']
    # remove columns
    del df['cpIDs']
    del df['ShiftName']
    return df


print jez(df1)  
print jez2(df2)
print rom(df)

print jez(df1)  
   ID Name  Rate Secondname
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN
print jez2(df2)
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8          B
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0          F
7   6    H  34.0        NaN
8   7    I   2.0        NaN
print rom(df) 
   ID Name  Rate SecondName
0   1    A  65.5        NaN
1   2    B  67.3          C
2   2    C  78.8        NaN
3   3    D  65.0        NaN
4   4    E  45.3        NaN
5   5    F  52.0          G
6   5    G  66.0        NaN
7   6    H  34.0        NaN
8   7    I   2.0        NaN

编辑：

如果有更多重复的名称对，请使用shift来创建first和last列：

df.loc[ df['ID'] == df['ID'].shift(), 'first'] = df['Name']
df.loc[ df['ID'] == df['ID'].shift(-1), 'last'] = df['Name']