Spark请求的最大计数是多少？

Question

Spark请求的最大计数是多少？

3

我是一个Spark的初学者，我想发出一个请求，以便让我检索最常访问的网页。

我的请求如下：

mostPopularWebPageDF = logDF.groupBy("webPage").agg(functions.count("webPage").alias("cntWebPage")).agg(functions.max("cntWebPage")).show()

通过这个请求，我只获取了一个最大计数的数据框，但我想获取一个包含该分数和保持该分数的网页的数据框。

类似于这样：

webPage            max(cntWebPage)
google.com         2

我该如何解决我的问题？

非常感谢。

- JackR

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- T. Gawęda · Accepted Answer

在 pyspark + sql 中：

logDF.registerTempTable("logDF")

mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
                                            select webPage, count(*) as cntWebPage, max(count(*)) over () as maxcnt 
                                            from logDF 
                                            group by webPage) as tmp
                                            where tmp.cntWebPage = tmp.maxcnt""")

也许我可以让它更简洁，但它是可行的。我会尝试优化它。

我的结果：

webPage      cntWebPage
google.com   2

对于数据集：

webPage    usersid
google.com 1
google.com 3
bing.com   10

解释：正常计数是通过分组和count(*)函数完成的。所有这些计数的最大值是通过窗口函数计算的，因此对于上面的数据集，立即DataFrame /不删除maxCount列/ 是：

webPage    count  maxCount
google.com 2      2
bing.com   1      2

然后我们选择计数等于maxCount的行。

编辑：我已删除DSL版本 - 它不支持window over()，而且排序会改变结果。对于此错误，我感到非常抱歉。SQL版本是正确的。