如何在Ubuntu 12.04上集成Jupyter笔记本和pyspark？

Question

如何在Ubuntu 12.04上集成Jupyter笔记本和pyspark？

apache-sparkipythonpysparkjupyterjupyter-notebook

4

我是Pyspark的新手。我在Ubuntu上安装了“bash Anaconda2-4.0.0-Linux-x86_64.sh”和pyspark。一切都在终端中正常工作。我想在jupyter上使用它。当我在我的Ubuntu终端中创建配置文件时，如下所示：

wanderer@wanderer-VirtualBox:~$ ipython profile create pyspark
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_config.py'
[ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_kernel_config.py'

wanderer@wanderer-VirtualBox:~$ export ANACONDA_ROOT=~/anaconda2
wanderer@wanderer-VirtualBox:~$ export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython
wanderer@wanderer-VirtualBox:~$ export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python

wanderer@wanderer-VirtualBox:~$ cd spark-1.5.2-bin-hadoop2.6/
wanderer@wanderer-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ PYTHON_OPTS=”notebook” ./bin/pyspark
Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
16/04/24 15:27:42 INFO SparkContext: Running Spark version 1.5.2
16/04/24 15:27:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

16/04/24 15:27:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:33514 with 530.3 MB RAM, BlockManagerId(driver, localhost, 33514)
16/04/24 15:27:53 INFO BlockManagerMaster: Registered BlockManager
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.

In [1]: sc
Out[1]: <pyspark.context.SparkContext at 0x7fc96cc6fd10>

In [2]: print sc.version
1.5.2

In [3]:

以下是 Jupyter 和 IPython 的版本

wanderer@wanderer-VirtualBox:~$ jupyter --version
4.1.0

wanderer@wanderer-VirtualBox:~$ ipython --version
4.1.2

我尝试将Jupyter Notebook和PySpark集成，但一切都失败了。我想在Jupyter中使用PySpark，但不知道该如何集成它们。

有人可以展示一下如何集成这些组件吗？

- Wanderer

3

请查看此链接：Jupyter与PySpark连接。 - Alberto Bonsanto

@AlbertoBonsanto...太好了...问题终于解决了，我开始在pyspark上练习了。给出的链接清除了我的障碍！！！ - Wanderer

3个回答

10

使用nano或vim将以下两行添加到pyspark：

PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"

- volonte volonte

4

编辑于2017年10月

使用Spark 2.2和findspark，这个功能可以很好地工作，无需那些环境变量。

import findspark
findspark.init('/opt/spark')
import pyspark
sc = pyspark.SparkContext()

旧版

我发现最快的方式是运行：

export PYSPARK_DRIVER=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark

或者是jupyter的等效物。这将打开一个启用了pyspark的ipython笔记本。您也可以查看 Beaker notebook。

- citynorman

更简单的方法是在命令行中运行：IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark。可在此处找到：http://npatta01.github.io/2015/08/01/pyspark_jupyter/ - citynorman

在Spark 2.0版本以上，似乎已经删除了IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark。 - Neal

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MyounghoonKim · Accepted Answer

只需运行以下命令：

PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark