我通常使用spark.createDataFrame()
,在早期版本的Pandas中,它会向我抛出关于iteritems()调用的弃用警告。但是在Pandas 2.0.0中,它根本无法工作,导致以下错误:
AttributeError Traceback (most recent call last)
File <command-2209449931455530>:64
61 df_train_test_p.loc[df_train_test_p.is_train=='N','preds']=preds_test
63 # save the original table and predictions into spark dataframe
---> 64 df_test = spark.createDataFrame(df_train_test_p.loc[df_train_test_p.is_train=='N'])
65 df_results = df_results.union(df_test)
67 # saving all relevant data
File /databricks/spark/python/pyspark/instrumentation_utils.py:48, in _wrap_function.<locals>.wrapper(*args, **kwargs)
46 start = time.perf_counter()
47 try:
---> 48 res = func(*args, **kwargs)
49 logger.log_success(
50 module_name, class_name, function_name, time.perf_counter() - start, signature
51 )
52 return res
File /databricks/spark/python/pyspark/sql/session.py:1211, in SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
1207 data = pd.DataFrame(data, columns=column_names)
1209 if has_pandas and isinstance(data, pd.DataFrame):
1210 # Create a DataFrame from pandas DataFrame.
-> 1211 return super(SparkSession, self).createDataFrame( # type: ignore[call-overload]
1212 data, schema, samplingRatio, verifySchema
1213 )
1214 return self._create_dataframe(
1215 data, schema, samplingRatio, verifySchema # type: ignore[arg-type]
1216 )
File /databricks/spark/python/pyspark/sql/pandas/conversion.py:478, in SparkConversionMixin.createDataFrame(self, data, schema, samplingRatio, verifySchema)
476 warn(msg)
477 raise
--> 478 converted_data = self._convert_from_pandas(data, schema, timezone)
479 return self._create_dataframe(converted_data, schema, samplingRatio, verifySchema)
File /databricks/spark/python/pyspark/sql/pandas/conversion.py:516, in SparkConversionMixin._convert_from_pandas(self, pdf, schema, timezone)
514 else:
515 should_localize = not is_timestamp_ntz_preferred()
--> 516 for column, series in pdf.iteritems():
517 s = series
518 if should_localize and is_datetime64tz_dtype(s.dtype) and s.dt.tz is not None:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-fefe10af-04b7-4277-b395-2f16b77bd90b/lib/python3.9/site-packages/pandas/core/generic.py:5981, in NDFrame.__getattr__(self, name)
5974 if (
5975 name not in self._internal_names_set
5976 and name not in self._metadata
5977 and name not in self._accessors
5978 and self._info_axis._can_hold_identifiers_and_holds_name(name)
5979 ):
5980 return self[name]
-> 5981 return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'iteritems'
我该如何解决这个问题?我使用的是Spark 3.3.2。我发现了一个看起来更新的代码,已经替换了有问题的调用,链接在这里:https://github.com/apache/spark/blob/master/python/pyspark/sql/pandas/conversion.py但不确定它是哪个版本以及是否可用。
编辑:下面是重现问题的示例代码:
import pandas as pd
from pyspark.sql import SparkSession
# create a sample pandas dataframe
data = {'name': ['John', 'Mike', 'Sara', 'Adam'], 'age': [25, 30, 18, 40]}
df_pandas = pd.DataFrame(data)
# convert pandas dataframe to PySpark dataframe
spark = SparkSession.builder.appName('pandasToSpark').getOrCreate()
df_spark = spark.createDataFrame(df_pandas)
# show the PySpark dataframe
df_spark.show()