字符串分割，排除特定字符

Question

字符串分割，排除特定字符

3

我正在使用逗号作为分隔符将字符串拆分成行。

for col in [col for col in df.loc[:,df.columns.str.contains(">")]]: #only on colnames containing ">"
    df[col] = df[col].str.split(", ")
    df = df.explode(col).reset_index(drop=True)

然而，有三个子字符串中出现了“自然”的逗号，不应该导致拆分：

与性取向、性生活和/或性取向相关的数据
合同、工资和福利
采购、分包和供应商管理

我在想，既然只有这三种情况，是否有一种方法可以使用类似于此类的方式做出一些异常：“preferences，”，“sex life，”，“Contract，”和“Procurement，”。还是有一些更优雅的解决方法？

这是一个示例 df：

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

以下是应输出的内容：

+-------------------------------------------------------------------------+
|                                 col > 1                                 |
+-------------------------------------------------------------------------+
| Personals                                                               |
| Financials                                                              |
| Data related to sexual preferences, sex life, and/or sexual orientation |
| Personals                                                               |
| Financials                                                              |
| Vendors                                                                 |
| Procurement, subcontracting and vendor management                       |
+-------------------------------------------------------------------------+

- torkestativ

我有一个类似的问题，但我想利用“”来表示内部逗号应该被忽略。下面的答案似乎没有注意到“”。 - Maddenker

2个回答

1

您可以在df.str.split()中使用带有多个负回顾断言的正则表达式模式，以实现基本上是"在,分割行，除非在,之前出现了..."。

在Python中实现这一点，最好使用多个负回顾断言-Python正则表达式强制执行固定宽度的环视，因此不像用由|分隔的子句的单个负回溯那样简单。

使用您示例中的短语在,上拆分，除非前面有任何列出的短语，您可以使用：

r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),"

完整的代码示例：

import pandas as pd

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

df["col > 1"] = df["col > 1"].str.split(r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),")

df = df.explode("col > 1").reset_index(drop=True)

这将为您提供一个带有所需["col > 1"]值的df，如您在问题中概述的那样，还有一个新的索引0...n。

i.e

                                             col > 1
0                                          Personals
1                                         Financials
2   Data related to sexual preferences, sex life,...
3                                          Personals
4                                         Financials
5                                            Vendors
6   Procurement, subcontracting and vendor manage...

- JPI93

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David Erickson · Accepted Answer

您可以暂时将这些例外情况中的逗号替换为其他字符(比如分号;)。
创建一个以逗号为分隔符的列表。
将数据框拆分。
将分号替换为逗号。

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})
r1 = ['Data related to sexual preferences, sex life, and/or sexual orientation',
      'Contract, salary and benefits',
      'Procurement, subcontracting and vendor management']
r2 = ['Data related to sexual preferences; sex life; and/or sexual orientation',
      'Contract; salary and benefits',
      'Procurement; subcontracting and vendor management']
df = df.replace(r1,r2, regex=True)
df['col > 1'] = df['col > 1'].str.split(',')
df = df.explode('col > 1').replace(r2,r1,regex=True)
df
Out[1]: 
                                             col > 1
0                                          Personals
0                                         Financials
0   Data related to sexual preferences, sex life,...
1                                          Personals
1                                         Financials
2                                            Vendors
2   Procurement, subcontracting and vendor manage...