Pyspark：如何从时间戳中提取小时数

Question

Pyspark：如何从时间戳中提取小时数

14

我有一个如下所示的表格

    df

 +------------------------------------+-----------------------+
|identifier                          |timestamp              |
+------------------------------------+-----------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|
+------------------------------------+-----------------------+

tmp.printSchema()
root
 |-- identifier: string (nullable = true)
 |-- timestamp: string (nullable = true)

我想要一个仅提取时间戳中日期和小时的列。

我正在尝试以下操作：

from pyspark.sql.functions import hour
df = df.withColumn("hour", hour(col("timestamp")))

但我得到了以下内容

+--------------------+--------------------+----+
|          identifier|           timestamp|hour|
+--------------------+--------------------+----+
|321869c3-71e5-41d...|2020-03-19 03:34:...|null|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|null|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|null|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|null|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|null|
+--------------------+--------------------+----+

虽然我想要拥有

+--------------------+--------------------+-------------------+
|          identifier|           timestamp|hour               |
+--------------------+--------------------+-------------------+
|321869c3-71e5-41d...|2020-03-19 03:00:...|2020-03-19 03:00:00|
|226b8d50-2c6a-471...|2020-03-19 02:59:...|2020-03-19 02:00:00|
|47818b7c-34b5-43c...|2020-03-19 01:41:...|2020-03-19 01:00:00|
|f5ca5599-7252-49d...|2020-03-19 04:25:...|2020-03-19 04:00:00|
|add2ae24-aa7b-4d3...|2020-03-19 01:50:...|2020-03-19 01:00:00|
+--------------------+--------------------+-------------------+

- emax

请编辑问题以包括模式（df.printSchema()），并使用truncate=False显示数据框。 - pault

同时请指定您期望的输出。 - CPak

@pault 刚刚修改了答案。 - emax

5个回答

6

您希望获取日期和小时，可以使用pyspark提供的函数仅提取日期和小时如下：

3个步骤：

将时间戳列转换为时间戳格式
使用日期函数从时间戳格式中提取日期
使用小时函数从时间戳格式中提取小时

代码如下:

from pyspark.sql.functions import *
# Step 1: transform to the correct col format
df = df.withColumn("timestamp", to_timestamp("timestamp", 'yyyy-MM-dd HH:mm:ss'))

# Step 2 & 3: Extract the needed information
df = df.withColumn('Date', date(df.timestamp))
df = df.withColumn('Hour', hour(df.timestamp))

# Display the result
df.show(1, False)
#+----------+--------------------+-------------------+-------------------+
#|identifier|           timestamp|               Date|               Hour|
#+----------+--------------------+-------------------+-------------------+
#|         1|2020-03-19 03:00:...|                 19|                 03|
#+----------+--------------------+-------------------+-------------------+

小时列并不完全像您所描述的，因为它已经被notNull在上面回答过了。如果您只想获取日期和小时的数量以供以后的分组或聚合使用，那么这是另一种方法。

- Duong Vu

当前在pyspark中不存在这种形式的“date”函数。要实现上述结果，需要使用“dayofmonth”。 - Pengshe

3

为什么不使用自定义UDF呢？

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType

hour = F.udf(lambda x: x.hour, IntegerType())
hours = df.withColumn("hour", hour("datetime"))

hours.limit(5).toPandas()

将为您提供以下内容：

- SamuelNLP

1

使用自定义UDF比使用内置的PySpark函数更加昂贵。 - aishik roy chaudhury

3

使用from_unixtime和unix_timestamp函数来提取timestamp（或）string(yyyy-MM-dd HH:mm:ss)类型中的小时值，就像使用hour一样。

from pyspark.sql.functions import *
#sample data
df.show(truncate=False)
#+----------+-----------------------+
#|identifier|timestamp              |
#+----------+-----------------------+
#|1         |2020-03-18 14:41:55 UTC|
#+----------+-----------------------+
#DataFrame[identifier: string, timestamp: string]

df.withColumn("hour", from_unixtime(unix_timestamp(col("timestamp"),"yyyy-MM-dd hh:mm:ss"),"yyyy-MM-dd hh:00:00")).show()
#+----------+--------------------+-------------------+
#|identifier|           timestamp|               hour|
#+----------+--------------------+-------------------+
#|         1|2020-03-18 14:41:...|2020-03-18 14:00:00|
#+----------+--------------------+-------------------+

使用 hour 函数：

#on string type 
spark.sql("select hour('2020-03-04 12:34:34')").show()
#on timestamp type
spark.sql("select hour(timestamp('2020-03-04 12:34:34'))").show()
#+---+
#|_c0|
#+---+
#| 12|
#+---+

- notNull

2

对于Spark 3.3.0版本，简单使用hour和weekofyear就能完成操作。假设：timestamp已经是正确的格式。

from pyspark.sql import functions as SF

(
  df
  .withColumn("hour"      , SF.hour("timestamp") )
  .withColumn("weekofyear", SF.weekofyear("timestamp") )
  .show(n=2)
)

- Curious Watcher

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- murtihash · Accepted Answer

你应该使用pyspark内置函数date_trunc来截断到小时hour。你也可以截断到日/月/年等。

from pyspark.sql import functions as F
df.withColumn("hour", F.date_trunc('hour',F.to_timestamp("timestamp","yyyy-MM-dd HH:mm:ss 'UTC'")))\
  .show(truncate=False)


+------------------------------------+-----------------------+-------------------+
|identifier                          |timestamp              |hour               |
+------------------------------------+-----------------------+-------------------+
|86311425-0890-40a5-8950-54cbaaa60815|2020-03-18 14:41:55 UTC|2020-03-18 14:00:00|
|38e121a8-f21f-4d10-bb69-26eb045175b5|2020-03-13 15:19:21 UTC|2020-03-13 15:00:00|
|1a69c9b0-283b-4b6d-89ac-66f987280c66|2020-03-16 12:59:51 UTC|2020-03-16 12:00:00|
|c7b5c53f-bf40-498f-8302-4b3329322bc9|2020-03-18 22:05:06 UTC|2020-03-18 22:00:00|
|0d3d807b-9b3a-466e-907c-c22402240730|2020-03-17 18:40:03 UTC|2020-03-17 18:00:00|
+------------------------------------+-----------------------+-------------------+