将pandas数据框中特定列的特定行移位

3

我有这个数据框df

我试图将前两列中包含 NaNs 的行向左移动,使右侧的值填充到这一列中。这是我目前正在尝试做的:

(注意: match 数据框可以在此链接下载:https://www.kaggle.com/hugomathien/soccer)

#original dataframe
<class 'pandas.core.frame.DataFrame'>
Int64Index: 21374 entries, 145 to 25978
Data columns (total 47 columns):
id                  21374 non-null int64
country_id          21374 non-null int64
league_id           21374 non-null int64
season              21374 non-null object
stage               21374 non-null int64
date                21374 non-null object
match_api_id        21374 non-null int64
home_team_api_id    21374 non-null int64
away_team_api_id    21374 non-null int64
home_team_goal      21374 non-null int64
away_team_goal      21374 non-null int64
goal                13325 non-null object
shoton              13325 non-null object
shotoff             13325 non-null object
foulcommit          13325 non-null object
card                13325 non-null object
cross               13325 non-null object
corner              13325 non-null object
possession          13325 non-null object
BSA                 11856 non-null float64
Home Team           21374 non-null object
Away Team           21374 non-null object
League              21374 non-null object
Country             21374 non-null object
home_player_1       21374 non-null object
home_player_2       21374 non-null object
home_player_3       21374 non-null object
home_player_4       21374 non-null object
home_player_5       21374 non-null object
home_player_6       21374 non-null object
home_player_7       21374 non-null object
home_player_8       21374 non-null object
home_player_9       21374 non-null object
home_player_10      21374 non-null object
home_player_11      21374 non-null object
away_player_1       21374 non-null object
away_player_2       21374 non-null object
away_player_3       21374 non-null object
away_player_4       21374 non-null object
away_player_5       21374 non-null object
away_player_6       21374 non-null object
away_player_7       21374 non-null object
away_player_8       21374 non-null object
away_player_9       21374 non-null object
away_player_10      21374 non-null object
away_player_11      21374 non-null object
winner              21374 non-null object
dtypes: float64(1), int64(9), object(37)
memory usage: 7.8+ MB

创建数据框架。
columns = match.columns[match.columns.get_loc('home_player_1'):match.columns.get_loc('away_player_1')+1].values
columns = list(columns)

player_appearences = match.groupby(columns[0]).size().reset_index()
player_appearences.rename(columns = {0:"Count_{}".format(player_appearences.columns[0][len(player_appearences.columns[0])-1])}, inplace = True, errors='raise')
player_appearences
for i in range(1,12):
    player_appearences2 = match.groupby(columns[i]).size().reset_index()
    player_appearences2
    player_appearences2.rename(columns = {0:"Count_{}".format(player_appearences2.columns[0][len(player_appearences2.columns[0])-1])}, inplace = True, errors='raise')
    player_appearences = player_appearences.merge(right = player_appearences2,how="outer",left_on ="{}".format(player_appearences.columns[0]),right_on = "{}".format(player_appearences2.columns[0]))
    player_appearences
    #overwrite nans in first column with names in current [i] player column

#select rows where first two columns give nan values
player_appearences.loc[(player_appearences.loc[:,"home_player_1"].isna()==True) & (player_appearences.loc[:,"Count_1"].isna()==True),["home_player_1","Count_1"]] = player_appearences.loc[(player_appearences.loc[:,"home_player_1"].isna()==True) & (player_appearences.loc[:,"Count_1"].isna()==True),["home_player_2","Count_2"]]

当我打印 player_appearences 时,数据框没有发生变化。我不确定是它根本没有做任何事情,还是它创建了原始数据框的副本。有人能告诉我这为什么不起作用吗?如果有更好的方法,请给出建议。


1
你能否包含一个数据框的样本而不是图片? - Edeki Okoh
你最开始是怎么得到那个数据框的? - Umar.H
根据您的描述,最简单的方法可能是将数据框拆分为两个部分,在两个部分上进行dropna操作,然后将它们合并在一起。请包含一个[mcve],其中包括示例输入和输出以及您已经尝试过的代码。 - G. Anderson
@EdekiOkoh更新了问题。抱歉。 - Sean
你需要将数据框以代码的形式分享,这样我们才能将其粘贴到我们的IDE中。 - gold_cy
@aws_apprentice 我从这里下载了一个预制数据库:https://www.kaggle.com/hugomathien/soccer - Sean
2个回答

2
使用 DataFrame.rename,然后只需要 DataFrame.stack(默认情况下dropna = True)+ DataFrame.unstack
 df = (df.rename(columns = {'home_player_2':'home_player_1',
                           'Count_2':'Count_1'}).stack().unstack()
       .reindex(columns = df.columns[:2]))
print(df)
  home_player_1 Count_1
0         Aaron       1
1          Adam       2
2         Ziggy       3
3        Zoltan       4

或者使用DataFrame.shiftDataFrame.where
df.where(df.notna(),df.shift(-1,axis = 1)).iloc[:,:2]


  home_player_1  Count_1
0         Aaron      1.0
1          Adam      2.0
2         Ziggy      3.0
3        Zoltan      4.0

细节

print(df.where(df.notna(),df.shift(-1,axis = 1)))
  home_player_1  Count_1 home_player_2  Count_2
0         Aaron      1.0           NaN      NaN
1          Adam      2.0           NaN      NaN
2         Ziggy      3.0         Ziggy      3.0
3        Zoltan      4.0        Zoltan      4.0

你使用过 DataFrame.where 吗?我需要逐列检查吗?我的答案是完全不同的.... - ansev
是的,你原来的回答确实不同,使用了不同的方法(堆栈/取消堆栈)。在我发布了另一种选择后,你把它加入到你的答案中,好像一直都在那里。这有点不诚实。 - wombatonfire
我正在写我的答案,我没有从你那里复制任何东西,我甚至没有看到那个shift,当我看到所有的 isna 时,我不想再继续查看。我的答案显然是不同和简单的。但思想是自由的。 - ansev
你有没有想过使用where或iloc或直接使用DataFrame.notna?为什么你认为只有你能想到使用shift?你刚刚在一个无限糟糕的解决方案中发布了它。 - ansev
笑,伙计,你已经雇了律师了吗?你让我开心 :D - wombatonfire
显示剩余2条评论

1
你可以使用 shift(-1, axis=1) 来移动列,df[df.home_player_1.isna() & df.Count_1.isna()] 用于指定要影响的行。需要移动的行应在数据框中重写。
df = pd.DataFrame([['Aaron', 1, None, None],
                   ['Adam', 2, None, None],
                   [None, None, 'Ziggy', 3],
                   [None, None, 'Zoltan', 4]],
                  columns=['home_player_1', 'Count_1', 'home_player_2', 'Count_2'])

home_player_1   Count_1     home_player_2   Count_2
Aaron           1.0         None            NaN
Adam            2.0         None            NaN
None            NaN         Ziggy           3.0
None            NaN         Zoltan          4.0

df[df.home_player_1.isna() & df.Count_1.isna()] = df[df.home_player_1.isna() & df.Count_1.isna()].shift(-1, axis=1)

home_player_1   Count_1     home_player_2   Count_2
Aaron           1.0         None            NaN
Adam            2.0         None            NaN
Ziggy           3.0         NaN             NaN
Zoltan          4.0         NaN             NaN

为什么要使用 df.home_player_1.isna() & df.Count_1.isna()?也许更好的方法是 df.iloc[:,:2].isna().all(axis=1)。如果列数不是2而是50会发生什么? - ansev
为什么不使用 m = df.home_player_1.isna() & df.Count_1.isna() 呢?这样就不用重复计算了。 - ansev
由于问题的范围,我们知道有4列,并且该问题与性能无关。 - wombatonfire
我认为一个可以扩展到更多列的解决方案更好,易于阅读,当然也更高效。 如果我是 OP,我会寻找最有效的解决方案。 - ansev

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接