我经历了漫长而痛苦的道路才找到了一个在这里起作用的解决方案。
我正在使用VS Code中的本地jupyter服务器。在那里,我创建了一个.env
文件:
SPARK_HOME=/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
PYSPARK_SUBMIT_ARGS="--driver-memory 2g --executor-memory 6g --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 pyspark-shell"
然后在我的 Python 笔记本中,我有类似以下的代码:
from pyspark.sql.types import *
from graphframes import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('GraphFrames').getOrCreate()
你应该查看代码并相应地打印出和获取依赖项。就像这样:
:: loading settings :: url = jar:file:/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/adam/.ivy2/cache
The jars for the packages stored in: /home/adam/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-96a3a1f1-4ea4-4433-856b-042d0269ec1a;1.0
confs: [default]
found graphframes#graphframes;0.8.2-spark3.2-s_2.12 in spark-packages
found org.slf4j#slf4j-api;1.7.16 in central
:: resolution report :: resolve 174ms :: artifacts dl 8ms
:: modules in use:
graphframes#graphframes;0.8.2-spark3.2-s_2.12 from spark-packages in [default]
org.slf4j#slf4j-api;1.7.16 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
之后,我能够使用关系创建一些代码:
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
它应该可以正常工作。只需记住对齐所有的pyspark版本。我不得不从一个fork的repo安装适当版本的graphframes
。PiPy安装版本滞后,所以我必须使用PHPirates
repo 来进行正确的安装。这里的graphframes已经编译为pyspark
的3.2.0
版本。
pip install "git+https://github.com/PHPirates/graphframes.git@add-setup.py#egg=graphframes&subdirectory=python"
pip install pyspark==3.2.0