将Scala列表转换为DataFrame或DataSet

10

我是Scala的新手。我正在尝试将Scala列表(其中包含源DataFrame上某些计算数据的结果)转换为Dataframe或Dataset。我没有找到任何直接的方法来完成这个操作。但是,我已经尝试了以下过程来将我的列表转换为DataSet,但似乎无法工作。下面提供了3种情况。

请问有人可以给我一些希望之光,告诉我如何进行这种转换吗?谢谢。

import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import java.sql.{Connection, DriverManager, ResultSet, Timestamp}
import scala.collection._

case class TestPerson(name: String, age: Long, salary: Double)
var tom = new TestPerson("Tom Hanks",37,35.5)
var sam = new TestPerson("Sam Smith",40,40.5)

val PersonList = mutable.MutableList[TestPerson]()

//Adding data in list
PersonList += tom
PersonList += sam

//Situation 1: Trying to create dataset from List of objects:- Result:Error
//Throwing error
var personDS = Seq(PersonList).toDS()
/*
ERROR:
error: Unable to find encoder for type stored in a Dataset.  Primitive types
   (Int, String, etc) and Product types (case classes) are supported by     
importing sqlContext.implicits._  Support for serializing other types will  
be added in future releases.
     var personDS = Seq(PersonList).toDS()

*/
//Situation 2: Trying to add data 1-by-1 :- Result: not working as desired.    
the last record overwriting any existing data in the DS
var personDS = Seq(tom).toDS()
personDS = Seq(sam).toDS()

personDS += sam //not working. throwing error


//Situation 3: Working. However, I am having consolidated data in the list    
which I want to convert to DS; if I loop the results of the list in comma  
separated values and then pass that here, it will work but will create an  
extra loop in the code, which I want to avoid.
var personDS = Seq(tom,sam).toDS()
scala> personDS.show()
+---------+---+------+
|     name|age|salary|
+---------+---+------+
|Tom Hanks| 37|  35.5|
|Sam Smith| 40|  40.5|
+---------+---+------+

你的Spark和Scala版本是什么? - Ajeet Shah
Spark版本为1.6.1。 - Leo
3个回答

19

不使用 Seq 进行尝试:

case class TestPerson(name: String, age: Long, salary: Double)
val tom = TestPerson("Tom Hanks",37,35.5)
val sam = TestPerson("Sam Smith",40,40.5)
val PersonList = mutable.MutableList[TestPerson]()
PersonList += tom
PersonList += sam

val personDS = PersonList.toDS()
println(personDS.getClass)
personDS.show()

val personDF = PersonList.toDF()
println(personDF.getClass)
personDF.show()
personDF.select("name", "age").show()

输出:

class org.apache.spark.sql.Dataset

+---------+---+------+
|     name|age|salary|
+---------+---+------+
|Tom Hanks| 37|  35.5|
|Sam Smith| 40|  40.5|
+---------+---+------+

class org.apache.spark.sql.DataFrame

+---------+---+------+
|     name|age|salary|
+---------+---+------+
|Tom Hanks| 37|  35.5|
|Sam Smith| 40|  40.5|
+---------+---+------+

+---------+---+
|     name|age|
+---------+---+
|Tom Hanks| 37|
|Sam Smith| 40|
+---------+---+

另外,请确保将TestPerson案例类的声明移动到对象范围之外

请注意,这是已经翻译好的文本,不需要再次翻译。

感谢上面的解决方案,它在 Dataset 的情况下有效。我的最终目标是在 DataFrame 中获取数据。我使用了这个命令 "scala> val RowsDF = sc.parallelize(personDS).toDF()" 但是出现了错误 "<console>:51: error: type mismatch; found : org.apache.spark.sql.Dataset[TestPerson] required: Seq[?] val RowsDF = sc.parallelize(personDS).toDF() " - Leo
我得到了这个:scala> val RowsDF = personDS.toDF() RowsDF: org.apache.spark.sql.DataFrame = [name: string, age: bigint, salary: double] - Leo

1
使用 Seq:
val spark = SparkSession.builder().appName("Spark-SQL").master("local[2]").getOrCreate()

import spark.implicits._

var tom = new TestPerson("Tom Hanks",37,35.5)
var sam = new TestPerson("Sam Smith",40,40.5)

val PersonList = mutable.MutableList[TestPerson]()

//Adding data in list
PersonList += tom
PersonList += sam

//It will be work.
var personDS = Seq(PersonList).toDS()

SQLContext.implicits


1
case class TestPerson(name: String, age: Long, salary: Double)

val spark = SparkSession.builder().appName("List to Dataset").master("local[*]").getOrCreate()

var tom = new TestPerson("Tom Hanks",37,35.5)
var sam = new TestPerson("Sam Smith",40,40.5)
   
// mutable.MutableList[TestPerson]() is not required , i used below way which was 
// cleaner
val PersonList =  List(tom,sam)

import spark.implicits._
PersonList.toDS().show

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接