zero323的回答非常高效。其他大多数答案应该避免使用。
以下是另一种高效的解决方案,利用了 quinn 库,非常适合生产代码库:
df = spark.createDataFrame([(1,2), (3,4)], ['x1', 'x2'])
def rename_col(s):
mapping = {'x1': 'x3', 'x2': 'x4'}
return mapping[s]
actual_df = df.transform(quinn.with_columns_renamed(rename_col))
actual_df.show()
这是输出的DataFrame:
+
| x3| x4|
+
| 1| 2|
| 3| 4|
+
让我们查看使用 actual_df.explain(True)
输出的逻辑计划并验证其是否高效:
== Parsed Logical Plan ==
'Project ['x1 AS x3#52, 'x2 AS x4#53]
+- LogicalRDD [x1#48L, x2#49L], false
== Analyzed Logical Plan ==
x3: bigint, x4: bigint
Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
+- LogicalRDD [x1#48L, x2#49L], false
== Optimized Logical Plan ==
Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
+- LogicalRDD [x1#48L, x2#49L], false
== Physical Plan ==
*(1) Project [x1#48L AS x3#52L, x2#49L AS x4#53L]
解析后的逻辑计划和物理计划基本相等,因此 Catalyst 并不需要进行大量的优化工作。
应避免多次调用withColumnRenamed
,因为它会创建一个低效的解析计划,需要进行优化。
让我们来看一个不必要复杂的解析计划:
def rename_columns(df, columns):
for old_name, new_name in columns.items():
df = df.withColumnRenamed(old_name, new_name)
return df
def rename_col(s):
mapping = {'x1': 'x3', 'x2': 'x4'}
return mapping[s]
actual_df = rename_columns(df, {'x1': 'x3', 'x2': 'x4'})
actual_df.explain(True)
== Parsed Logical Plan ==
Project [x3#52L, x2#49L AS x4#55L]
+- Project [x1#48L AS x3#52L, x2#49L]
+- LogicalRDD [x1#48L, x2#49L], false
== Analyzed Logical Plan ==
x3: bigint, x4: bigint
Project [x3#52L, x2#49L AS x4#55L]
+- Project [x1#48L AS x3#52L, x2#49L]
+- LogicalRDD [x1#48L, x2#49L], false
== Optimized Logical Plan ==
Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
+- LogicalRDD [x1#48L, x2#49L], false
== Physical Plan ==
*(1) Project [x1#48L AS x3#52L, x2#49L AS x4#55L]
withColumnRenamed
的答案。出于此博客文章中概述的原因,应避免使用withColumnRenamed
方法。请参阅我的答案以获取更多详细信息。 - Powers