给定一个PySpark DataFrame,是否可能获得由该DataFrame引用的源列的列表?
也许更具体的例子可以帮助解释我的需求。假设我定义了一个DataFrame:
import pyspark.sql.functions as func
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
source_df = spark.createDataFrame(
[("pru", 23, "finance"), ("paul", 26, "HR"), ("noel", 20, "HR")],
["name", "age", "department"],
)
source_df.createOrReplaceTempView("people")
sqlDF = spark.sql("SELECT name, age, department FROM people")
df = sqlDF.groupBy("department").agg(func.max("age").alias("max_age"))
df.show()
它返回:
+----------+--------+
|department|max_age |
+----------+--------+
| finance| 23|
| HR| 26|
+----------+--------+
df
引用的列是[department,age]
。是否可以以编程方式获取引用列的列表?
感谢在pyspark中捕获explain()的结果,我知道可以将计划提取为字符串:
df._sc._jvm.PythonSQLUtils.explainString(df._jdf.queryExecution(), "formatted")
该函数返回:
== Physical Plan ==
AdaptiveSparkPlan (6)
+- HashAggregate (5)
+- Exchange (4)
+- HashAggregate (3)
+- Project (2)
+- Scan ExistingRDD (1)
(1) Scan ExistingRDD
Output [3]: [name#0, age#1L, department#2]
Arguments: [name#0, age#1L, department#2], MapPartitionsRDD[4] at applySchemaToPythonRDD at NativeMethodAccessorImpl.java:0, ExistingRDD, UnknownPartitioning(0)
(2) Project
Output [2]: [age#1L, department#2]
Input [3]: [name#0, age#1L, department#2]
(3) HashAggregate
Input [2]: [age#1L, department#2]
Keys [1]: [department#2]
Functions [1]: [partial_max(age#1L)]
Aggregate Attributes [1]: [max#22L]
Results [2]: [department#2, max#23L]
(4) Exchange
Input [2]: [department#2, max#23L]
Arguments: hashpartitioning(department#2, 200), ENSURE_REQUIREMENTS, [plan_id=60]
(5) HashAggregate
Input [2]: [department#2, max#23L]
Keys [1]: [department#2]
Functions [1]: [max(age#1L)]
Aggregate Attributes [1]: [max(age#1L)#12L]
Results [2]: [department#2, max(age#1L)#12L AS max_age#13L]
(6) AdaptiveSparkPlan
Output [2]: [department#2, max_age#13L]
Arguments: isFinalPlan=false
这很有用,但不是我需要的。我需要一个引用列的列表。这可行吗?
或许另一种提问方式是...有没有一种方法可以将执行计划作为对象获取,以便我可以遍历/探索它?
更新。感谢@matt-andruff的回复,我已经得到了这个:
df._jdf.queryExecution().executedPlan().treeString().split("+-")[-2]
返回:
' Project [age#1L, department#2]\n '
我猜我可以从中解析出所需的信息,但这远非优雅的方法,而且容易出错。
实际上,我真正需要的是一种可靠、安全的API支持方式来获取这些信息。我开始觉得这似乎不可能。
df.columns
。 - Steven