我想从现有的数据框中选择多个列(在连接后创建),并希望按照我目标表的结构对字段进行排序。如何实现?我使用的方法如下。我能够选择所需的必要列,但无法使它们按顺序排列。
Required (Target Table structure) :
hist_columns = ("acct_nbr","account_sk_id", "zip_code","primary_state", "eff_start_date" ,"eff_end_date","eff_flag")
account_sk_df = hist_process_df.join(broadcast(df_sk_lkp) ,'acct_nbr','inner' )
account_sk_df_ld = account_sk_df.select([c for c in account_sk_df.columns if c in hist_columns])
>>> account_sk_df
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, hash_sk_id: string, account_sk_id: int]
>>> account_sk_df_ld
DataFrame[acct_nbr: string, primary_state: string, zip_code: string, eff_start_date: string, eff_end_date: string, eff_flag: string, account_sk_id: int]
account_sk_id需要排在第二位。最好的方法是什么?
df.select("col1", "col2")
,而不是df.select(["col1", "col2"])
。使用*
操作符可以将列表解包为单独的列名,这是PySpark所期望的。 - kevin_theinfinityfund