从列表创建DataFrame

3

我试图创建Spark DataFrame,在其中我想将一个列表转换为列。

代码:

def create_id(n):
    return ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(50))

list_a = [create_id(25) for x in range(100)]
list_b = [create_id(25) for x in range(100)]

df = sc.parallelize([["a", list_a], ["b", list_b]]).toDF()

这导致
    _1                                                _2
0   a   [dv2vtdl3sobadlw1svs39emp2n9ogwzzek8b6gvug7xkp...
1   b   [kdv6b9ehqx1t8kbxd77ha8435bhduyxp0ilv6e09wpejx..

这将创建100列,而不是100行:
df = sc.parallelize([list_a, list_b]).toDF()

有人知道如何创建一个包含两列和100行的DataFrame吗?


这回答了你的问题吗?手动创建Pyspark DataFrame - Steven
我已经看过这个,但是对我来说不太合适,因为它使用元组,而元组的索引负责值所在的列。 - Data Mastery
那么,你不理解它的工作原理,因为这正是应该做的方式。 - Steven
1个回答

4

使用post 手动创建Pyspark数据框

def create_id(n):
    return ''.join(random.choice(string.ascii_lowercase + string.digits) for _ in range(n))

list_a = [create_id(25) for _ in range(100)]
list_b = [create_id(25) for _ in range(100)]

df = spark.createDataFrame(zip(list_a,list_b), ['a', 'b'])

# OR

list_a_b = [(create_id(25), create_id(25)) for _ in range(100)]
df = spark.createDataFrame(list_a_b, ['a', 'b'])

df.show()
+--------------------+--------------------+
|                   a|                   b|
+--------------------+--------------------+
|68blfnltq9fh81c4y...|3fl1wb5h2euy3sgd7...|
|ac37fb7qif71zzjpr...|xbqzzgiq9s6t5jiqm...|
|72rk28znzr6jjsi69...|5wvl528eg5y3p1lsk...|
|fioqnla3ijvl5769s...|1xvs2592uaxadv1o4...|
|7der8ld8fd6vl6g9d...|lrup85xitjz1uhsfl...|
|gycdap4hodaxxggw8...|h2oz370tzo6fnpke3...|
|ccvqcyzeynuks63pq...|iut82y2k1irfdvep1...|
|ngq29fnq2usghspgh...|z6j4mibrrjznoc9s8...|
|3qb6xyk5c1kbg0xq1...|l10ouv4w24d66e0ak...|
|u6dcvzede90xa7zz2...|hnh571t9szy0pwjrp...|
|3122g38k47jm24t7f...|tzbxlua574l88qtw1...|
|6pnva6ow83yxexqp1...|0nfj3v59b8jh0qv1g...|
|kl7xyftax3z32ot8o...|0sf6iyiyxpyvyd5kj...|
|36qwiiifgbzba4n8c...|xt4lpkjle8qynnlpo...|
|owsgb02rnov8qrhvw...|1zu4oisit25y2g14i...|
|bcmg0flh4d9tnvnjc...|7lfwx9kf7qens70p8...|
|6sdy1e8i3y1w0rtpr...|gw79bsrx8jlse6ixu...|
|83h5iq10clte1gcpr...|kblufuhlwabu7sv3u...|
|7g20ga0m756f0qsj7...|1fzo40vwtrp0kud8j...|
|07tw66i7dpcphczz1...|9a8c9ditp9dzomxh4...|
+--------------------+--------------------+
only showing top 20 rows


1
谢谢,zip非常适合这种用例 :-) - Data Mastery

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接