使用 Pandas：从列中提取所有正则表达式匹配项，使用分隔符连接

Question

使用 Pandas：从列中提取所有正则表达式匹配项，使用分隔符连接

6

我需要从一个列中的字符串中提取所有匹配项，并填充第二个列。匹配项将会用逗号分隔。

df2 = pd.DataFrame([[1000, 'Jerry', 'string of text BR1001_BR1003_BR9009 more string','BR1003',''], 
                [1001, '', 'BR1010_BR1011 random text', 'BR1010',''], 
                ['', '', 'test to discardBR3009', 'BR2002',''],
                [1003, 'Perry','BR4009 pure gibberish','BR1001',''],
                [1004, 'Perry2','','BR1001','']],
               columns=['ID', 'Name', 'REGEX string', 'Member of','Status'])

表示要提取的代码的模式。

BR_pat = re.compile(r'(BR[0-9]{4})', re.IGNORECASE)

期望在列中输出

BR1001, BR1003, BR9009
BR1010,BR1011
BR3009
BR4009

My attempt:

df2['REGEX string'].str.extractall(BR_pat).unstack().fillna('').apply(lambda x: ", ".join(x))

输出：

 match
0  0        BR1001, BR1010, BR3009, BR4009
   1                    BR1003, BR1011, , 
   2                          BR9009, , ,

多余的逗号和缺少行，我做错了什么？

- EA Bubnoff

2个回答

1

您也可以

在apply中添加axis=1以使用列而不是行。
添加filter(None,x)以过滤空字符串。

结果为

df['REGEX string'].str.extractall(BR_pat).unstack().fillna('').apply(lambda x : ",".join(filter(None,x)), axis=1)

- Klaus78

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

您需要使用：

>>> df2['REGEX string'].str.findall(r'BR\d{4}').str.join(", ")
0    BR1001, BR1003, BR9009
1            BR1010, BR1011
2                    BR3009
3                    BR4009
4                          
Name: REGEX string, dtype: object

使用 Series.str.findall，您可以提取字符串值中模式的所有出现次数，并返回一个 "字符串列表的 Series/Index"。要将它们合并成单个字符串，使用 Series.str.join()。