Pyspark：根据另一个数据框列中的ID选择行

Question

Pyspark：根据另一个数据框列中的ID选择行

pythonapache-sparkpyspark

3

我希望你能够：

通过 time_create==last_timestamp 条件筛选 df1，
根据 df1 中选择的 store_product_id 筛选 df2。

注意：请保留 HTML 标签。

这里我只是举例使用了df1，

按照时间创建进行选择很好：

df1[df1.time_create==last_timestamp].show()

然而，使用选定的store_product_id来过滤原始数据框df1给了我很多行。

df1[df1.store_product_id.isin(df1[df1.time_create==last_timestamp].store_product_id)].show()

我也尝试收集与 time_create==last_timestamp 匹配的 store_product_id 列表。

ids = df1[df1.time_create==last_timestamp].select('store_product_id').collect()
df1[df1.store_product_id.isin(ids)].show()

但是出现了错误：

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.lit.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [01e8f3c0-3ad5-4b69-b46d-f5feb3cadd5f]
    at org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:78)
    at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
    at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$create$2.apply(literals.scala:164)
    at scala.util.Try.getOrElse(Try.scala:79)
    at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:163)
    at org.apache.spark.sql.functions$.typedLit(functions.scala:127)
    at org.apache.spark.sql.functions$.lit(functions.scala:110)
    at org.apache.spark.sql.functions.lit(functions.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

什么是正确的方式？

- Mithril

将 .select('store_product_id') 替换为 .select(['store_product_id'])？ - ma3oun

@ma3oun .select('store_product_id') 已经可以正常工作。错误来自于 df1[df1.store_product_id.isin(ids)]。看起来 isin 只接受 Python 的列表或元组。但是之前的代码甚至没有失败，非常奇怪。 - Mithril

1

收集的 ids 可能是以列表行数据结构的形式存在，我猜你需要一个值列表，所以使用 toPandas 而不是 collect，然后提取值列表。 - ags29

2个回答

0

正如@ags29所说，

df1[df1.time_create==last_timestamp].select(['store_product_id']).collect()的结果是Row列表：

[Row(store_product_id=u'01e8f3c0-3ad5-4b69-b46d-f5feb3cadd5f')]

我需要将行转换为字符串，正确的方法是：

ids = df1[df1.time_create==last_timestamp].select('store_product_id').collect()
ids = map(lambda x: x.store_product_id, ids)
df1[df1.store_product_id.isin(ids)].show()

这与pandas非常不同。

- Mithril

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ma3oun · Accepted Answer

你要找的函数是 join。以下是基于你提供的数据的简单示例：

import pyspark as sp
from pyspark.sql import SparkSession

samples = [{'store_product_id':1,'time_create':2,'last_timestamp':3},{'store_product_id':2,'time_create':2,'last_timestamp':2},{'store_product_id':3,'time_create':4,'last_timestamp':4},{'store_product_id':4,'time_create':2,'last_timestamp':5}]

spark = SparkSession \
        .builder \
        .appName('test') \
        .getOrCreate()

df1 = spark.createDataFrame(samples)
df1.show()

这会得到以下结果:

+--------------+----------------+-----------+
|last_timestamp|store_product_id|time_create|
+--------------+----------------+-----------+
|             3|               1|          2|
|             2|               2|          2|
|             4|               3|          4|
|             5|               4|          2|
+--------------+----------------+-----------+

让我们按时间进行筛选，并从中创建另一个数据框：

df2 = df1.filter(df1.time_create==df1.last_timestamp)
ids = df2.select('store_product_id').show()

+----------------+
|store_product_id|
+----------------+
|               2|
|               3|
+----------------+

这就是我们在store_product_id上将两个数据集结合起来的地方：

df3 = df1.join(df2,'store_product_id','inner').show()

+----------------+--------------+-----------+--------------+-----------+
|store_product_id|last_timestamp|time_create|last_timestamp|time_create|
+----------------+--------------+-----------+--------------+-----------+
|               3|             4|          4|             4|          4|
|               2|             2|          2|             2|          2|
+----------------+--------------+-----------+--------------+-----------+

内连接返回基于 store_product_id 的 df1 和 df2 的交集。