基于逗号将一列拆分为几列

3

我想要将一个地址列拆分成特定列,例如城市和省份。

我有一个数据框,看起来像这样:

df:
+----------------------------------------------------------------------------------------------------------+
|location                                                                                          
+----------------------------------------------------------------------------------------------------------+
| Jl. Raya Pasir Putih No.6, RT.1/RW.6, Pasir Putih, Kec. Sawangan, Kota Depok, Jawa Barat 16519, Indonesia   
| Jl. Legenda Wisata, Wanaherang, Kec. Gn. Putri, Bogor, Jawa Barat 16965, Indonesia                 
| Jl. Blk. C7 No.17, Rangkapan Jaya Baru, Kec. Pancoran Mas, Kota Depok, Jawa Barat 16434, Indonesia 
| Jl. Cibuntu Sayuran No.12, Wr. Muncang, Kec. Bandung Kulon, Kota Bandung, Jawa Barat 40211, Indonesia
| 1 KOMP, Jl. Tirtawening No.10, Cisurupan, Kec. Cibiru, Kota Bandung, Jawa Barat 40614, Indonesia
+----------------------------------------------------------------------------------------------------------+

我希望将其提取到名为“城市和省份”的另一列中。
输出可能如下所示:
df:

+-------------+-------------------+------------+
| location    |  Cities           |  province  | 
+-------------+-------------------+------------+
|  .....      |  Kota Depok       | Jawa Barat |    
|  .....      |  Bogor            | Jawa Barat |      
|  .....      |  Kota Depok       | Jawa Barat |     
|  .....      |  Kota Bandung     | Jawa Barat |    
|  .....      |  Kota Bandung     | Jawa Barat |   
+-------------+------------+-------------------+

我尝试使用了这种方法:

def extract_city_state(a):
    asplit = a.split(",")
    city = asplit[-3].split()
    state = asplit[-2].split()[0:1]
    return city, state

df.join(
    df['location'].apply(
        lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
    )
)

但它返回的结果是

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-29-64a945be5d02> in <module>
      1 df.join(
      2     df['location'].apply(
----> 3         lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
      4     )
      5 )

~\anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
   4043             else:
   4044                 values = self.astype(object).values
-> 4045                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   4046 
   4047         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()

<ipython-input-29-64a945be5d02> in <lambda>(x)
      1 df.join(
      2     df['location'].apply(
----> 3         lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
      4     )
      5 )

<ipython-input-22-f1d63ccd82dc> in extract_city_state(a)
      1 def extract_city_state(a):
      2     asplit = a.split(",")
----> 3     city = asplit[-3].split()
      4     state = asplit[-2].split()[0:1]
      5     return city, state

IndexError: list index out of range

如何克服这个问题?

2个回答

1

为避免使用 str[] 索引时出现错误,请仅使用 pandas 的 str 函数来选择匹配值 - 首先使用 Series.str.split 创建列表系列,然后使用 Series.str.rsplit 仅按最后一个空格分割,因为参数 n=1

s = df['location'].str.split(',')

df['city'] = s.str[-3]
df['province'] = s.str[-2].str.rsplit(n=1).str[0]
print (df)
                                            location           city  \
0  Jl. Raya Pasir Putih No.6, RT.1/RW.6, Pasir Pu...     Kota Depok   
1  Jl. Legenda Wisata, Wanaherang, Kec. Gn. Putri...          Bogor   
2  Jl. Blk. C7 No.17, Rangkapan Jaya Baru, Kec. P...     Kota Depok   
3  Jl. Cibuntu Sayuran No.12, Wr. Muncang, Kec. B...   Kota Bandung   
4  1 KOMP, Jl. Tirtawening No.10, Cisurupan, Kec....   Kota Bandung   

      province  
0   Jawa Barat  
1   Jawa Barat  
2   Jawa Barat  
3   Jawa Barat  
4   Jawa Barat  

1
如果您想保留此函数,只需在将其与df连接之前将lambda函数的结果存储为变量即可:
city_state_split = df['location'].apply(
        lambda x: pd.Series(extract_city_state(x), index=["City", "State"])
    )
df.join(city_state_split)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接