Python:如何删除字符串左侧第二个逗号左侧的所有文本。

3

我想要删除数据框中包含"County, Texas"的字符串左边第二个逗号到末尾的所有文本。例如:

转换前:

  1. "Jack Smith, Bank, Wilber, Lincoln County, Texas"
  2. "Jack Smith, Bank, Credit, Bank, Wilber, Lincoln County, Texas"
  3. "Jack Smith, Bank, Union, Credit, Bank, Wilber, Lincoln County, Texas, Branch, Landing, Services"
  4. "Jack Smith, Bank, Credit, Bank, Wilber, Branch, Landing, Services"

转换后:

  1. "Jack Smith, Bank"
  2. "Jack Smith, Bank"
  3. "Jack Smith, Bank, Union"
  4. "Jack Smith, Bank, Credit, Bank, Wilber, Branch, Landing, Services"

谢谢您的帮助!


请问您能否提供有关数据框的更多信息,并制作一个最小可复现示例吗? - Celius Stingher
@CeliusStingher 从Before/After可以很清楚地看出来,你还有什么想法吗? - user1717828
1
我正在尝试理解为什么第2行和第3行显示的是“Jack Smith, Bank”而不是“Jack Smith, Union”。提供一个最小化可复现示例(MVE)将有助于解决这个问题。 - Celius Stingher
通过正则表达式搜索 ^([^,]*,[^,]*),.*County, Texas.* 并替换为 \1 捕获 group(1) 的想法。 - bobble bubble
3个回答

4

使用maskstr.contains()来执行符合指定条件的行的操作,然后使用以下操作:.str.split(', ').str[0:2].agg(', '.join)):

df['Col'] = df['Col'].mask(df['Col'].str.contains('County, Texas'),
                           df['Col'].str.split(', ').str[0:2].agg(', '.join))

完整代码:

import pandas as pd
df = pd.DataFrame({'Col': {0: 'Jack Smith, Bank, Wilber, Lincoln County, Texas',
  1: 'Jack Smith, Union, Credit, Bank, Wilber, Lincoln County, Texas',
  2: 'Jack Smith, Union, Credit, Bank, Wilber, Lincoln County, Texas, Branch, Landing, Services',
  3: 'Jack Smith, Union, Credit, Bank, Wilber, Branch, Landing, Services'}})
df['Col'] = df['Col'].mask(df['Col'].str.contains('County, Texas'),
                           df['Col'].str.split(', ').str[0:2].agg(', '.join))                            
df
Out[1]: 
                                                 Col
0                                   Jack Smith, Bank
1                                  Jack Smith, Union
2                                  Jack Smith, Union
3  Jack Smith, Union, Credit, Bank, Wilber, Branc...

根据更新的问题,您可以使用np.select:
import pandas as pd
df = pd.DataFrame({'Col': {0: 'Jack Smith, Bank, Wilber, Lincoln County, Texas',
  1: 'Jack Smith, Bank, Credit, Bank, Wilber, Lincoln County, Texas',
  2: 'Jack Smith, Bank, Union, Credit, Bank, Wilber, Lincoln County, Texas, Branch, Landing, Services',
  3: 'Jack Smith, Bank, Credit, Bank, Wilber, Branch, Landing, Services'}})
df['Col'] = np.select([df['Col'].str.contains('County, Texas') & ~df['Col'].str.contains('Union'),
                       df['Col'].str.contains('County, Texas') & df['Col'].str.contains('Union')],
                      [df['Col'].str.split(', ').str[0:2].agg(', '.join),
                       df['Col'].str.split(', ').str[0:3].agg(', '.join)],
                       df['Col'])                            
df
Out[2]: 
                                                 Col
0                                   Jack Smith, Bank
1                                   Jack Smith, Bank
2                            Jack Smith, Bank, Union
3  Jack Smith, Bank, Credit, Bank, Wilber, Branch...

谢谢你,David。但是在我更新的第三个案例中,我该如何保留第二个逗号后面的内容呢?(也就是说,“union”这个单词要被保留?) - Andrew
@Andrew,请查看更新后的答案。将来请提出一个新问题,引用此问题,因为更改初始问题可能会完全改变解决方案。谢谢! - David Erickson

2

您可以简单地使用maplambdasplitjoin的组合:

df['Example'] = df['Example'].map(lambda x: ','.join(x.split(',')[0:2]) if 'County, Texas' in x else x)

在这种情况下:
import pandas as pd
df = pd.DataFrame({'Example':["Jack Smith, Bank, Wilber, Lincoln County, Texas","Jack Smith, Union, Credit, Bank, Wilber, Lincoln County, Texas",
                              "Jack Smith, Union, Credit, Bank, Wilber, Lincoln County, Texas, Branch, Landing, Services",
                              "Jack Smith, Union, Credit, Bank, Wilber, Branch, Landing, Services"]})
df['Example'] = df['Example'].map(lambda x: ','.join(x.split(',')[0:2]) if 'County, Texas' in x else x)

我们得到以下输出:
                                             Example
0                                   Jack Smith, Bank
1                                  Jack Smith, Union
2                                  Jack Smith, Union
3  Jack Smith, Union, Credit, Bank, Wilber, Branc...

1
数据
df = pd.DataFrame({'text':["Jack Smith, Bank, Wilber, Lincoln County, Texas","Jack Smith, Union, Credit, Bank, Wilber, Lincoln County, Texas",
                              "Jack Smith, Union, Credit, Bank, Wilber, Lincoln County, Texas, Branch, Landing, Services",
                              "Jack Smith, Union, Credit, Bank, Wilber, Branch, Landing, Services"]})

解决方案:使用 .str.extract
df['newtext']=df.text.str.extract('(^\w+\s\w+\,\s\w+)')



                                           text            newtext
0    Jack Smith, Bank, Wilber, Lincoln County, Texas   Jack Smith, Bank
1  Jack Smith, Union, Credit, Bank, Wilber, Linco...  Jack Smith, Union
2  Jack Smith, Union, Credit, Bank, Wilber, Linco...  Jack Smith, Union
3  Jack Smith, Union, Credit, Bank, Wilber, Branc...  Jack Smith, Union

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接