我正在尝试使用列表在pyspark中过滤数据框。我希望可以根据列表进行筛选,或者仅包括具有列表中值的记录。我的下面的代码无效:
# define a dataframe
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,10), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
# define a list of scores
l = [10,18,20]
# filter out records by scores by list l
records = df.filter(df.score in l)
# expected: (0,1), (0,1), (0,2), (1,2)
# include only records with these scores in list l
records = df.where(df.score in l)
# expected: (1,10), (1,20), (3,18), (3,18), (3,18)
给出以下错误:
ValueError: 无法将列转换为布尔值:在构建DataFrame布尔表达式时,请使用“&”表示“and”,“|”表示“or”,“~”表示“not”。
l_bc = sc.broadcast(l)
,然后是df.where(df.score.isin(l_bc.value))
。 - Alex_Gidiotis