如何根据某一列的值删除行，其中一些行的列值是另一行的子集？

Question

如何根据某一列的值删除行，其中一些行的列值是另一行的子集？

5

假设我有一个名为 df 的数据框，如下所示：

index company  url                          address 
 0     A .    www.abc.contact.com         16D Bayberry Rd, New Bedford, MA, 02740, USA
 1     A .    www.abc.contact.com .       MA, USA
 2     A .    www.abc.about.com .         USA
 3     B .    www.pqr.com .               New Bedford, MA, USA
 4     B.     www.pqr.com/about .         MA, USA

我希望您能够删除所有dataframe中地址是另一个地址的子集且公司相同的行。例如，我要从上述5行中删除这两行。

index  company  url                          address 
 0     A .    www.abc.contact.com         16D Bayberry Rd, New Bedford, MA, 02740, USA
 3     B .    www.pqr.com .               New Bedford, MA, USA

- Hari_pb

2

什么定义了“子集(subset)”? 因为字符串 'MA, USA' 不是 company='A' 中任何内容的子字符串。第一行确实分别包含这两个单词，但你希望每个地址部分都被逗号分隔并单独检查吗？ - ALollz

@ALollz，通过“subset”，我指的是删除标点符号后，我们应该得到一个字符串地址，其中包含所有其他列出的地址（就像字符串子集匹配）。 - Hari_pb

@Harry_pb 这不是一个简单的情况。运行它可能会耗费时间，因为您需要删除标点符号，然后拆分字符串，然后检查所有子字符串是否在公司的“地址”列中出现。对于每一行都要重复这个过程。这太疯狂了！您能否以某种方式简化它？ - Maksim Terpilowski

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Teoretic · Accepted Answer

也许这不是最优解，但对于这个小数据框来说已经足够了。编辑时需要检查公司名称，假设我们已经去除了标点符号。

df = pd.DataFrame({"company": ['A', 'A', 'A', 'B', 'B'],
                   "address": ['16D Bayberry Rd, New Bedford, MA, 02740, USA',
                               'MA, USA',
                               'USA',
                               'New Bedford, MA, USA',
                               'MA, USA']})
# Splitting addresses by column and making sets from every address to use "issubset" later
addresses = list(df['address'].apply(lambda x: set(x.split(', '))).values)
companies = list(df['company'].values)

rows_to_drop = []  # Storing row indexes to drop here
# Iterating by every address
for i, (address, company) in enumerate(zip(addresses, companies)):
    # Iteraing by the remaining addresses
    rem_addr = addresses[:i] + addresses[(i + 1):]
    rem_comp = companies[:i] + companies[(i + 1):]

    for other_addr, other_comp in zip(rem_addr, rem_comp):
        # If address is a subset of another address, add it to drop
        if address.issubset(other_addr) and company == other_comp:
            rows_to_drop.append(i)
            break

df = df.drop(rows_to_drop)
print(df)

company address
0   A   16D Bayberry Rd, New Bedford, MA, 02740, USA
3   B   New Bedford, MA, USA