使用多个字符串分隔符将Pandas列拆分为多个列

Question

使用多个字符串分隔符将Pandas列拆分为多个列

5

I have a dataframe:

id    info
1     Name: John Age: 12 Sex: Male
2     Name: Sara Age: 22 Sex: Female
3     Name: Mac Donald Age: 32 Sex: Male

我希望将信息列分成3列，以便我获得最终输出结果：

id  Name      Age   Sex
1   John      12   Male
2   Sara      22   Female
3 Mac Donald  32   Male

我尝试使用 pandas 的 split 函数。

df[['Name','Age','Sex']] = df.info.split(['Name'])

我可能需要多次执行此操作才能获得所需的结果。

是否有更好的方法来实现这个呢？

PS: info列还包含NaN值。

- Shubham R

4个回答

2

正则表达式写起来 / 读起来很困难，所以你可以用 , 替换需要拆分成新列的地方，并使用 str.split() 并传递 expand=True。你需要将结果设置回三个新列，这些列是你用 df[['Name', 'Age', 'Sex']] 创建的：

df[['Name', 'Age', 'Sex']] = (df['info'].replace(['Name: ', ' Age: ', ' Sex: '], ['',',',','], regex=True)
                              .str.split(',', expand=True))
df

Out[1]: 
   id                                info        Name Age     Sex
0   1        Name: John Age: 12 Sex: Male        John  12    Male
1   2      Name: Sara Age: 22 Sex: Female        Sara  22  Female
2   3  Name: Mac Donald Age: 32 Sex: Male  Mac Donald  32    Male

- David Erickson

2

一个快速的一行代码可以是：

df[['Name', 'Age', 'Sex']] = df['info'].str.split('\s?\w+:\s?', expand=True).iloc[:, 1:]

使用someword进行拆分，然后添加新列。

- Vishnudev Krishnadas

啊，我忘记了\w+。太好了！但是这会创建空格，所以你需要使用str.strip()或者改进正则表达式。首先，你可以使用\w+: 代替\w+:，但这只能去掉最后一个空格。 - David Erickson

Strip不会有帮助，这也不会创建空格。额外的列是由于字符串以分割字符串开头的事实造成的。@DavidErickson - Vishnudev Krishnadas

如果你使用 df['Name'].to_list() 或者 df['Age'].to_list()，你会发现每个字符串前后都有一个 ' '。 - David Erickson

哦！你是说在我得到结果之后。是的，strip 应该解决这个问题。 - Vishnudev Krishnadas

有没有一种自动分配列名的方法？因为我遇到了类似的问题，其中“info”列非常大，我想将它们放在多个列中。例如，记录可能看起来像“a:1 b:2 c:3 d:4 ........... pop:500”，我希望自动生成列名。 - Regressor

显示剩余3条评论

0

  def process_row(row):
        items = row.info.split(' ')
        row['Name']=str(items[1]).strip()
        row['Age']=str(items[3]).strip()
        row['Sex']=str(items[5]).strip()
        return row

  df=pd.DataFrame({"info": ['Name: John Age: 12 Sex: Male', 'Name: Sara Age: 22 Sex: 
     Female', 'Name: Mac Donald Age: 32 Sex: Male']})
  df['Name']=pd.NA #empty cell
  df['Age']=pd.NA #empty cell
  df['Sex']=pd.NA #empty cell

  df[['info','Name','Age','Sex']]=df.apply(process_row, axis=1, result_type="expand")

- paytam

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rakesh · Accepted Answer

使用具有命名组的正则表达式。

示例：

df = pd.DataFrame({"Col": ['Name: John Age: 12 Sex: Male', 'Name: Sara Age: 22 Sex: Female', 'Name: Mac Donald Age: 32 Sex: Male']})
df = df['Col'].str.extract(r"Name:\s*(?P<Name>[A-Za-z\s]+)\s*Age:\s*(?P<Age>\d+)\s*Sex:\s*(?P<Sex>Male|Female)") # Or if spacing is standard use df['Col'].str.extract(r"Name: (?P<Name>[A-Za-z\s]+) Age: (?P<Age>\d+) Sex: (?P<Sex>Male|Female)")
print(df)

输出：

          Name Age     Sex
0        John   12    Male
1        Sara   22  Female
2  Mac Donald   32    Male