谷歌上有很多解决这个问题的方案,但不幸的是,即使尝试了所有可能性,我仍然无法使其正常工作,请耐心等待并查看是否有灵感。
操作系统:MAC
Spark:1.6.3(2.10)
Jupyter Notebook:4.4.0
Python:2.7
Scala:2.12.1
我成功安装并运行了Jupyter笔记本。接下来,我尝试配置它与Spark一起使用,为此我使用Apache Toree安装了spark解释器。现在,当我在笔记本中尝试运行任何RDD操作时,会抛出以下错误:
Error from python worker:
/usr/bin/python: No module named pyspark
PYTHONPATH was:
/private/tmp/hadoop-xxxx/nm-local-dir/usercache/xxxx/filecache/33/spark-assembly-1.6.3-hadoop2.2.0.jar
尝试过的事情: 1. 在.bash_profile中设置PYTHONPATH 2. 本地Python命令行界面可以导入'pyspark' 3. 已经尝试更新解释器kernel.json到以下内容
{
"language": "python",
"display_name": "Apache Toree - PySpark",
"env": {
"__TOREE_SPARK_OPTS__": "",
"SPARK_HOME": "/Users/xxxx/Desktop/utils/spark",
"__TOREE_OPTS__": "",
"DEFAULT_INTERPRETER": "PySpark",
"PYTHONPATH": "/Users/xxxx/Desktop/utils/spark/python:/Users/xxxx/Desktop/utils/spark/python/lib/py4j-0.9-src.zip:/Users/xxxx/Desktop/utils/spark/python/lib/pyspark.zip:/Users/xxxx/Desktop/utils/spark/bin",
"PYSPARK_SUBMIT_ARGS": "--master local --conf spark.serializer=org.apache.spark.serializer.KryoSerializer",
"PYTHON_EXEC": "python"
},
"argv": [
"/usr/local/share/jupyter/kernels/apache_toree_pyspark/bin/run.sh",
"--profile",
"{connection_file}"
]
}
- 已经更新了解释器的run.sh文件,以显式加载py4j-0.9-src.zip和pyspark.zip文件。当打开PySpark笔记本并创建SparkContext时,我可以看到spark-assembly、py4j和pyspark包从本地上传,但是在调用操作时,某种方式下找不到pyspark。