PicklingError:无法序列化对象:IndexError:元组索引超出范围。

13

我在命令行中启动了pyspark,并执行以下操作来提高我的技能。

C:\Users\Administrator>SUCCESS: The process with PID 5328 (child process of PID 4476) has been terminated.
SUCCESS: The process with PID 4476 (child process of PID 1092) has been terminated.
SUCCESS: The process with PID 1092 (child process of PID 3952) has been terminated.
pyspark
Python 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022, 19:58:39) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/01/08 20:07:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Python version 3.11.1 (tags/v3.11.1:a7a450f, Dec  6 2022 19:58:39)
Spark context Web UI available at http://Mohit:4040
Spark context available as 'sc' (master = local[*], app id = local-1673188677388).
SparkSession available as 'spark'.
>>> 23/01/08 20:08:10 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
a = sc.parallelize([1,2,3,4,5,6,7,8,9,10])

当我执行a.take(1)时,我会得到“_pickle.PicklingError:无法序列化对象:IndexError:元组索引超出范围”的错误,我无法找到原因。当在Google Colab上运行相同的代码时,它不会抛出任何错误。以下是我在控制台中得到的内容。
>>> a.take(1)
Traceback (most recent call last):
  File "C:\Spark\python\pyspark\serializers.py", line 458, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 692, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 565, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 546, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 157, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in _extract_code_globals
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in <dictcomp>
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range
Traceback (most recent call last):
  File "C:\Spark\python\pyspark\serializers.py", line 458, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 602, in dump
    return Pickler.dump(self, obj)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 692, in reducer_override
    return self._function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 565, in _function_reduce
    return self._dynamic_function_reduce(obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 546, in _dynamic_function_reduce
    state = _function_getstate(func)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle_fast.py", line 157, in _function_getstate
    f_globals_ref = _extract_code_globals(func.__code__)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in _extract_code_globals
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\cloudpickle\cloudpickle.py", line 334, in <dictcomp>
    out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)}
                 ~~~~~^^^^^^^
IndexError: tuple index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Spark\python\pyspark\rdd.py", line 1883, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\context.py", line 1486, in runJob
    sock_info = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
                                                           ^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\rdd.py", line 3505, in _jrdd
    wrapped_func = _wrap_function(
                   ^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\rdd.py", line 3362, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\rdd.py", line 3345, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
                      ^^^^^^^^^^^^^^^^^^
  File "C:\Spark\python\pyspark\serializers.py", line 468, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range

它应该返回 [1] 作为答案,但是却抛出了这个错误。是因为安装不正确吗?
使用的软件包 - spark-3.3.1-bin-hadoop3.tgz、Java(TM) SE Runtime Environment (build 1.8.0_351-b10)、Python 3.11.1
请问有人能帮忙解决问题吗?先行致谢。

可能是Python版本不兼容的问题,你能否重新检查一下版本3.8 - Prakash Dahal
1
我解决了Python 3.9降级的问题,然后我使用python3.9 -m ensurepip命令安装了版本为3.9的pip,在完成后你可以使用python3.9 -m pip install pyspark进行安装。安装完成后你会得到一个错误提示,指出你正在使用Python 3.11运行pyspark ...这是一个环境变量问题,你需要更改两个变量: - Allexj
1
我在VSCode中使用Jupyter Lab,为了使VS Code JupyterLab中的变量正确,您必须打开Jupyter Lab扩展设置.JSON并放置"jupyter.runStartupCommands": ["import os\nos.environ['PYSPARK_PYTHON']='/bin/python3.9'\nos.environ['PYSPARK_DRIVER_PYTHON']='/bin/python3.9/'\n"] - Allexj
1
如果您想在整个系统中使用Python 3.9来运行Pyspark,可以在.bashrc文件中添加以下内容:export PYSPARK_PYTHON='/bin/python3.9'export PYSPARK_DRIVER_PYTHON='/bin/python3.9' - Allexj
@Allexj 谢谢你的解决方案。当我尝试运行Jupyter Notebook时,它仍然无法工作。 - Sophia
显示剩余3条评论
2个回答

12

我甚至尝试了Python 3.8.5版本,但是问题仍然存在。我在cmd中运行独立的Spark,它可以完美地工作并输出正确的结果。我同时安装了两个版本的Python,即3.8.5和3.11.1,并将3.8.5设置为默认版本。但是它依然抛出相同的错误。是否有任何更正方法? - Mohit Aswani
@MohitAswani 为确保您绝对不使用Python 3.11,我建议您彻底卸载它,然后观察情况,因为您可能有其他环境变量或Spark配置仍然指向3.11而不是3.8。 - PHPirate

9
截至 3/2/23,我遇到了完全相同的问题,并且如上所述,我卸载了 Python 3.11 并安装了版本 3.10.9,问题得到解决!

我刚刚安装了Python 3.10.11,除此之外还安装了3.11.x,并将3.10作为venv的基本解释器,用于PySpark 3.4和Spark 3.4.0服务器。它也无缝运行。 - Eugene Savchenko

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接