我是pyspark世界的新手。
想要在列days
上将两个数据帧df
和df_sd
进行连接。在连接时,还应该使用df
数据帧中的Name
列。如果来自df
数据帧的Name
和days
组合没有匹配值,则应为null
。请参见下面的代码和期望输出以更好地理解。
import findspark
findspark.init("/opt/spark")
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import SQLContext
from pyspark.sql.types import IntegerType
Mydata = Row("Name", "Number", "days")
spark = SparkSession \
.builder \
.appName("DataFrame Learning") \
.getOrCreate()
sqlContext = SQLContext(spark)
mydata1 = Mydata("A", 100, 1)
mydata2 = Mydata("A", 200, 2)
mydata3 = Mydata("B", 300, 1)
mydata4 = Mydata("B", 400, 2)
mydata5 = Mydata("B", 500, 3)
mydata6 = Mydata("C", 600, 1)
myDataAll = [mydata1, mydata2, mydata3, mydata4, mydata5, mydata6]
STANDARD_TENORS = [1, 2, 3]
df_sd = spark.createDataFrame(STANDARD_TENORS, IntegerType())
df_sd = df_sd.withColumnRenamed("value", "days")
df_sd.show()
df = spark.createDataFrame(myDataAll)
df.show()
+----+
# |days|
# +----+
# | 1|
# | 2|
# | 3|
# +----+
#
# +----+------+----+
# |Name|Number|days|
# +----+------+----+
# | A| 100| 1|
# | A| 200| 2|
# | B| 300| 1|
# | B| 400| 2|
# | B| 500| 3|
# | C| 600| 1|
# +----+------+----+
请查看下面来自join的预期结果。
# +----+------+----+
# |Name|Number|days|
# +----+------+----+
# | A| 100| 1|
# | A| 200| 2|
# | A|Null | 3|
# | B| 300| 1|
# | B| 400| 2|
# | B| 500| 3|
# | C| 600| 1|
# | C|Null | 2|
# | C|Null | 3|
# +----+------+----+
df_sd
会一直是一个小的日期列表吗?你的spark版本
是多少? - murtihash