我有一个数据框,其中有2列:account_id
和email_address
,现在我想添加另外一列updated_email_address
,我会调用某个函数对email_address
进行处理来获取updated_email_address
。这是我的代码:
def update_email(email):
print("== email to be updated: " + email)
today = datetime.date.today()
updated = substring(email, -8, 8) + str(today.strftime('%m')) + str(today.strftime('%d')) + "_updated"
return updated
df.withColumn('updated_email_address', update_email(df.email_address))
但结果显示updated_email_address
列为空:
+---------------+--------------+---------------------+
|account_id |email_address |updated_email_address|
+---------------+--------------+---------------------+
|123456gd7tuhha |abc@test.com |null |
|djasevneuagsj1 |cde@test.com |null |
+---------------+--------------+---------------+
在函数
updated_email
内部,它打印出了以下内容:Column<b'(email_address + == email to be udpated: )'>
同时还显示了数据框中列的数据类型:
dfData:pyspark.sql.dataframe.DataFrame
account_id:string
email_address:string
updated_email_address:double
为什么
updated_email_address
列类型是double?