字符串分割,排除特定字符

3
我正在使用逗号作为分隔符将字符串拆分成行。
for col in [col for col in df.loc[:,df.columns.str.contains(">")]]: #only on colnames containing ">"
    df[col] = df[col].str.split(", ")
    df = df.explode(col).reset_index(drop=True)

然而,有三个子字符串中出现了“自然”的逗号,不应该导致拆分:

  1. 与性取向、性生活和/或性取向相关的数据
  2. 合同、工资和福利
  3. 采购、分包和供应商管理

我在想,既然只有这三种情况,是否有一种方法可以使用类似于此类的方式做出一些异常:“preferences,”“sex life,”“Contract,”“Procurement,”。还是有一些更优雅的解决方法?

这是一个示例 df:

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

以下是应输出的内容:

+-------------------------------------------------------------------------+
|                                 col > 1                                 |
+-------------------------------------------------------------------------+
| Personals                                                               |
| Financials                                                              |
| Data related to sexual preferences, sex life, and/or sexual orientation |
| Personals                                                               |
| Financials                                                              |
| Vendors                                                                 |
| Procurement, subcontracting and vendor management                       |
+-------------------------------------------------------------------------+

我有一个类似的问题,但我想利用“”来表示内部逗号应该被忽略。下面的答案似乎没有注意到“”。 - Maddenker
2个回答

1
  1. 您可以暂时将这些例外情况中的逗号替换为其他字符(比如分号;)。
  2. 创建一个以逗号为分隔符的列表。
  3. 将数据框拆分。
  4. 将分号替换为逗号。

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})
r1 = ['Data related to sexual preferences, sex life, and/or sexual orientation',
      'Contract, salary and benefits',
      'Procurement, subcontracting and vendor management']
r2 = ['Data related to sexual preferences; sex life; and/or sexual orientation',
      'Contract; salary and benefits',
      'Procurement; subcontracting and vendor management']
df = df.replace(r1,r2, regex=True)
df['col > 1'] = df['col > 1'].str.split(',')
df = df.explode('col > 1').replace(r2,r1,regex=True)
df
Out[1]: 
                                             col > 1
0                                          Personals
0                                         Financials
0   Data related to sexual preferences, sex life,...
1                                          Personals
1                                         Financials
2                                            Vendors
2   Procurement, subcontracting and vendor manage...

1
您可以在df.str.split()中使用带有多个负回顾断言的正则表达式模式,以实现基本上是"在,分割行,除非在,之前出现了..."

在Python中实现这一点,最好使用多个负回顾断言-Python正则表达式强制执行固定宽度的环视,因此不像用由|分隔的子句的单个负回溯那样简单。

使用您示例中的短语在,上拆分,除非前面有任何列出的短语,您可以使用:

r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),"

完整的代码示例:
import pandas as pd

df = pd.DataFrame({"col > 1": ["Personals, Financials, Data related to sexual preferences, sex life, and/or sexual orientation", "Personals, Financials", "Vendors, Procurement, subcontracting and vendor management"]})

df["col > 1"] = df["col > 1"].str.split(r"(?<!preferences)(?<!sex life)(?<!Contract)(?<!Procurement),")

df = df.explode("col > 1").reset_index(drop=True)

这将为您提供一个带有所需["col > 1"]值的df,如您在问题中概述的那样,还有一个新的索引0...n

i.e

                                             col > 1
0                                          Personals
1                                         Financials
2   Data related to sexual preferences, sex life,...
3                                          Personals
4                                         Financials
5                                            Vendors
6   Procurement, subcontracting and vendor manage...

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接