我一直以为数据集/数据框架API是相同的,唯一的区别是数据集API会在编译时提供安全性。对吗?
所以,我有一个非常简单的案例:
case class Player (playerID: String, birthYear: Int)
val playersDs: Dataset[Player] = session.read
.option("header", "true")
.option("delimiter", ",")
.option("inferSchema", "true")
.csv(PeopleCsv)
.as[Player]
// Let's try to find players born in 1999.
// This will work, you have compile time safety... but it will not use predicate pushdown!!!
playersDs.filter(_.birthYear == 1999).explain()
// This will work as expected and use predicate pushdown!!!
// But you can't have compile time safety with this :(
playersDs.filter('birthYear === 1999).explain()
从第一个示例可以看出,它没有进行谓词下推(注意空的PushedFilters):
== Physical Plan ==
*(1) Filter <function1>.apply
+- *(1) FileScan csv [...] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:People.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<playerID:string,birthYear:int,birthMonth:int,birthDay:int,birthCountry:string,birthState:s...
第二个样本将正确执行(请注意PushedFilters):
== Physical Plan ==
*(1) Project [.....]
+- *(1) Filter (isnotnull(birthYear#11) && (birthYear#11 = 1999))
+- *(1) FileScan csv [...] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:People.csv], PartitionFilters: [], PushedFilters: [IsNotNull(birthYear), EqualTo(birthYear,1999)], ReadSchema: struct<playerID:string,birthYear:int,birthMonth:int,birthDay:int,birthCountry:string,birthState:s...
所以问题是...我如何使用DS Api,并在编译时确保安全,同时让谓词下推按预期工作?是否可能?如果不是...这是否意味着DS api为您提供了编译时安全性...但代价是性能!?(在处理大型parquet文件时,DF将更快)