在PySpark中将日期时间转换为日期

3

我有一个包含两列的数据框,分别为"date" (数据类型: 字符串)"modified" (数据类型: 大整数),如下所示:

+-------------------------------------+-------------+
|                                 date|     modified|
+-------------------------------------+-------------+
|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|
|      Mon, 18 Dec 2017 22:52:23 +0000|1513637587000|
|      Mon, 18 Dec 2017 22:52:03 +0000|1513637587000|
|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|
|      Mon, 18 Dec 2017 22:51:31 +0000|1513637527000|
|      Mon, 18 Dec 2017 22:51:38 +0000|1513637527000|
|      Mon, 18 Dec 2017 22:51:09 +0000|1513637526000|
|      Mon, 18 Dec 2017 22:50:55 +0000|1513637466000|
|      Mon, 18 Dec 2017 22:50:35 +0000|1513637466000|
|      Mon, 18 Dec 2017 17:49:35 -0500|1513637407000|
+-------------------------------------+-------------+

如何从这两列中提取YYYY-mm-dd(2017-12-18)?我尝试使用unix_timestampto_timestamp,但是没有效果。它返回了null值。
2个回答

3
你可以使用from_unixtime函数将bigint类型的unix时间戳转换为timestamp类型,然后再转换为date类型:
import pyspark.sql.functions as F

df2 = df.withColumn('parsed_date', F.from_unixtime(F.col('modified')/1000).cast('date'))

df2.show()
+--------------------+-------------+-----------+
|                date|     modified|parsed_date|
+--------------------+-------------+-----------+
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637587000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637527000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637526000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637466000| 2017-12-18|
|Mon, 18 Dec 2017 ...|1513637407000| 2017-12-18|
+--------------------+-------------+-----------+

1
许多问题已经在这里发布,关于如何在Spark中将字符串转换为日期(将pyspark字符串转换为日期格式, 将字符串转换为Dataframes中的日期格式...)。
您得到null是因为modified列是以毫秒为单位的时期时间,您需要将其除以1000以获取秒,然后将其转换为时间戳:
from pyspark.sql import functions as F

df1 = df.withColumn(
    "modified_as_date",
    F.to_timestamp(F.col("modified") / 1000).cast("date")
).withColumn(
    "date_as_date",
    F.to_date("date", "EEE, dd MMM yyyy HH:mm:ss")
)

df1.show(truncate=False)

#+-------------------------------------+-------------+----------------+------------+
#|date                                 |modified     |modified_as_date|date_as_date|
#+-------------------------------------+-------------+----------------+------------+
#|Mon, 18 Dec 2017 22:52:37 +0000 (UTC)|1513637587000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:52:23 +0000      |1513637587000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:52:03 +0000      |1513637587000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:43 +0000 (UTC)|1513637527000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:31 +0000      |1513637527000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:38 +0000      |1513637527000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:51:09 +0000      |1513637526000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:50:55 +0000      |1513637466000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 22:50:35 +0000      |1513637466000|2017-12-18      |2017-12-18  |
#|Mon, 18 Dec 2017 17:49:35 -0500      |1513637407000|2017-12-18      |2017-12-18  |
#+-------------------------------------+-------------+----------------+------------+

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接