注意,你目前的做法是不起作用的。首先,你试图从Row类型中获取整数,你的collect输出看起来像这样:
>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)
如果您有这样一个东西:
>>> firstvalue = mvv_list[0].mvv
Out: 1
您将获得mvv
值。如果您想要数组的所有信息,可以采用以下方式:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]
但是如果你尝试用同样的方法来处理另一列,你会得到:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
这是因为count
是一个内置方法,而列名与count
相同。解决此问题的方法是将count
列的列名更改为_count
:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
但是这种解决方法不再需要,因为您可以使用字典语法访问该列:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
最终它会正常工作!
list(df.select('mvv').toPandas()['mvv'])
。Arrow已集成到PySpark,这显著加速了toPandas
。如果您使用的是Spark 2.3+,请勿使用其他方法。有关更多基准测试细节,请参见我的答案。 - Powers