如何根据多个条件将Pandas数据框列中的字符串分割?

4

I have a pandas dataframe look like this:

    ID       Col.A

28654      This is a dark chocolate which is sweet 
39876      Sky is blue 1234 Sky is cloudy 3423
88776      Stars can be seen in the dark sky
35491      Schools are closed 4568 but shops are open

我试图在单词darkdigits之前分割Col.A。 我期望的结果如下。

     ID             Col.A                             Col.B
    
    28654      This is a                  dark chocolate which is sweet 
    39876      Sky is blue                1234 Sky is cloudy 3423
    88776      Stars can be seen in the   dark sky
    35491      Schools are closed         4568 but shops are open

我试图将包含单词 dark 的行分组到一个数据框中,将带有数字的行分组到另一个数据框中,然后相应地拆分它们。之后我可以连接得到的数据框以获得预期的结果。代码如下所示:
df = pd.DataFrame({'ID':[28654,39876,88776,35491], 'Col.A':['This is a dark chocolate which is sweet', 
                                                            'Sky is blue 1234 Sky is cloudy 3423', 
                                                            'Stars can be seen in the dark sky',
                                                            'Schools are closed 4568 but shops are open']})

df1 = df[df['Col.A'].str.contains(' dark ')==True]
df2 = df.merge(df1,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
df1 = df1["Col.A"].str.split(' dark ', expand = True)
df2 = df2["Col.A"].str.split('\d+', expand = True)
pd.concat([[df1, df2], axis =0)

得到的结果与预期不符,即:
      0                              1
0   This is a                   chocolate which is sweet
2   Stars can be seen in the     sky    
1   Sky is blue                  Sky is cloudy  
3   Schools are closed           but shops are open

我错过了字符串中的数字和结果中的单词dark

那么如何解决这个问题,不错过分割单词和数字就获取结果呢?

有没有一种方法可以"在期望的单词或数字之前切片"而不移除它们?

3个回答

7

Series.str.split

s = df['Col.A'].str.split(r'\s+(?=\b(?:dark|\d+)\b)', n=1, expand=True)
df[['ID']].join(s.set_axis(['Col.A', 'Col.B'], 1))

      ID                     Col.A                          Col.B
0  28654                 This is a  dark chocolate which is sweet
1  39876               Sky is blue        1234 Sky is cloudy 3423
2  88776  Stars can be seen in the                       dark sky
3  35491        Schools are closed        4568 but shops are open

正则表达式详解:

  • \s+:匹配一个或多个空白字符
  • (?=\b(?:dark|\d+)\b) :正向先行断言
    • \b :单词边界,防止部分匹配
    • (?:dark|\d+):非捕获组
      • dark:第一种选择,匹配字符串 "dark"
      • \d+:第二种选择,匹配一个或多个数字
    • \b :单词边界,防止部分匹配

请访问 regex demo 查看在线演示。


很酷。如果我在同一行中有“dark”和“darkest”,并且我需要在“dark”之前拆分,是否有任何方法可以做到这一点? - Athul R T
1
@AthulRT 是的,我们可以这样做。我已经编辑了答案。 - Shubham Sharma

4
使用您展示的样本,请尝试以下操作。使用Pandas的str.extract函数。简单地解释一下,就是使用提取函数并提及正则表达式来创建第一个捕获组,使用非贪婪匹配,并且第二个组具有数字或暗字符串,直到行末,并将其保存到Col.A和Col.B列中。
df[["Col.A","Col.B"]] = df['Col.A'].str.extract(r'(.*?)((?:dark|\d+).*)', expand=True)
df

使用所示样本输出将如下所示:
    ID      Col.A                       Col.B
0   28654   This is a                   dark chocolate which is sweet
1   39876   Sky is blue                 1234 Sky is cloudy 3423
2   88776   Stars can be seen in the    dark sky
3   35491   Schools are closed          4568 but shops are open

3
df[["Col.A", "Col.B"]] = df["Col.A"].str.split(
    r"\s*(dark.*|\d.*)", n=1, expand=True
)[[0, 1]]
print(df)

打印:

      ID                     Col.A                          Col.B
0  28654                 This is a  dark chocolate which is sweet
1  39876               Sky is blue        1234 Sky is cloudy 3423
2  88776  Stars can be seen in the                       dark sky
3  35491        Schools are closed        4568 but shops are open

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接