如何在Spark SQL中更改列的值

Question

如何在Spark SQL中更改列的值

6

在Sql中，我可以使用UPDATE轻松更新某些列的值，例如：我有一个名为student的表：

student_id, grade, new_student_id
123             B      234
555             A      null

UPDATE Student
SET student_id = new_student_id
WHERE new_student_id isNotNull

我该如何在Spark中使用SparkSql(PySpark)来实现呢？

- rainyballball

与 - https://dev59.com/5VoU5IYBdhLWcg3wYWOx 相关 - himanshuIIITian

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex · Accepted Answer

您可以使用withColumn来覆盖现有的new_student_id列，如果不为空，则使用原始的new_student_id值，否则将使用student_id列中的值：

from pyspark.sql.functions import col,when

#Create sample data
students = sc.parallelize([(123,'B',234),(555,'A',None)]).toDF(['student_id','grade','new_student_id'])

#Use withColumn to use student_id when new_student_id is not populated
cleaned = students.withColumn("new_student_id", 
          when(col("new_student_id").isNull(), col("student_id")).
          otherwise(col("new_student_id")))
cleaned.show()

使用您的示例数据作为输入：

+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
|       123|    B|           234|
|       555|    A|          null|
+----------+-----+--------------+

输出的数据如下:

+----------+-----+--------------+
|student_id|grade|new_student_id|
+----------+-----+--------------+
|       123|    B|           234|
|       555|    A|           555|
+----------+-----+--------------+