将Pandas数据框转换为Spark数据框时收到错误消息

3

由于Spark没有内置支持读取Excel文件的功能,因此我首先将Excel文件读入Pandas数据框中,然后尝试将Pandas数据框转换为Spark数据框,但是遇到了以下错误(我使用的是Spark 1.5.1)。

import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

pdf=pd.read_excel('/home/testdata/test.xlsx')
df = sqlContext.createDataFrame(pdf)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame
    rdd, schema = self._createFromLocal(data, schema)
  File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 337, in _createFromLocal
    data = [schema.toInternal(row) for row in data]
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in toInternal
    return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 541, in <genexpr>
    return tuple(f.toInternal(v) for f, v in zip(self.fields, obj))
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 435, in toInternal
    return self.dataType.toInternal(obj)
  File "/opt/spark/spark-hadoop/python/pyspark/sql/types.py", line 191, in toInternal
    else time.mktime(dt.timetuple()))
AttributeError: 'datetime.time' object has no attribute 'timetuple'

有人知道如何修复吗?

1
你能否发布一下你的 test.xlsx 的链接? - Sergey Bushmanov
https://drive.google.com/file/d/0B9n_aOz2bmxzVUc2S084dW1KR1E/view?usp=sharing - b4me
1个回答

1
我猜您遇到的问题与使用Pandas读取数据时“错误地”解析datetime数据有关。以下代码“正常工作”:
import pandas as pd
from pandas import ExcelFile
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *

pdf = pd.read_excel('test.xlsx', parse_dates=['Created on','Confirmation time'])

sc = SparkContext()
sqlContext = SQLContext(sc)

sqlContext.createDataFrame(data=pdf).collect()

[Row(Customer=1000935702, Country='TW',  ...

请注意,您还有一个日期时间列 'Confirmation date',在您的示例中由于存在NaT,因此可以毫无问题地读取到 RDD 中。但是如果您在完整数据集中有一些数据,则还需要注意该列。

1
之前的错误解决了,但我又遇到了一个类型错误。不知道是否需要逐个解决每一列的类型?谢谢 >>> df = sqlContext.createDataFrame(pdf) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 406, in createDataFrame rdd, schema = self._createFromLocal(data, schema) File "/opt/spark/spark-hadoop/python/pyspark/sql/context.py", line 322, in _createFromLocal ... TypeError: Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'> - b4me
@b4me 你可以考虑接受之前问题的解决方案,并将新问题作为一个新的问题发布。 - Sergey Bushmanov

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接