我有一个源系统,给我提供这样的数据:
Name |Hobbies
----------------------------------
"Han" |"Art;Soccer;Writing"
"Leia" |"Art;Baking;Golf;Singing"
"Luke" |"Baking;Writing"
每个爱好列表都以分号分隔。我希望将其转换为类似表格的结构,每个爱好占据一列,并标记该人是否选择了该爱好:
Name |Art |Baking |Golf |Singing |Soccer |Writing
--------------------------------------------------------------
"Han" |1 |0 |0 |0 |1 |1
"Leia" |1 |1 |1 |1 |0 |0
"Luke" |0 |1 |0 |0 |0 |1
以下是生成 pandas dataframe 中示例数据的代码:
>>> import pandas as pd
>>> df = pd.DataFrame(
... [
... {'name': 'Han', 'hobbies': 'Art;Soccer;Writing'},
... {'name': 'Leia', 'hobbies': 'Art;Baking;Golf;Singing'},
... {'name': 'Luke', 'hobbies': 'Baking;Writing'},
... ]
... )
>>> df
hobbies name
0 Art;Soccer;Writing Han
1 Art;Baking;Golf;Singing Leia
2 Baking;Writing Luke
目前,我正在使用以下代码将数据获取到一个DataFrame中,它具有所需的结构,但是速度非常慢(实际数据集大约有150万行):
>>> df2 = pd.DataFrame(columns=['name', 'hobby'])
>>>
>>> for index, row in df.iterrows():
... for value in str(row['hobbies']).split(';'):
... d = {'name':row['name'], 'value':value}
... df2 = df2.append(d, ignore_index=True)
...
>>> df2 = df2.groupby('name')['value'].value_counts()
>>> df2 = df2.unstack(level=-1).fillna(0)
>>>
>>> df2
value Art Baking Golf Singing Soccer Writing
name
Han 1.0 0.0 0.0 0.0 1.0 1.0
Leia 1.0 1.0 1.0 1.0 0.0 0.0
Luke 0.0 1.0 0.0 0.0 0.0 1.0
有没有更有效的方式来完成这个任务?