我正在Spark 3.0.0上执行Spark结构流的示例,我正在使用Twitter数据。我已经将Twitter数据推入Kafka,单个记录看起来像这样:
在这里,每个字段由'|'分隔,这些字段是:2020-07-21 10:48:19 |1265200268284588034 |RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…| Hyderabad, India
- 日期时间
- 用户ID
- Tweet文本
- 位置
key | value
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
根据这个答案,我在我的应用程序中添加了以下代码块。
split_col = pyspark.sql.functions.split(df['value'], '|')
df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")
但实际输出的结果却是这样的:
A | B | C | D | E |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |
但我希望输出结果如下
Tweet Time | User ID | Tweet text | Location |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |