使用嵌套字典列表过滤数据框

Question

使用嵌套字典列表过滤数据框

4

示例数据，我有一个叫做'df'的DataFrame：

id	a	b
1	HH	DOG
2	HH	CAT
3	W	DOG

有一个变量是嵌套字典列表filter_dict = [{'a': 'HH'}, {'a': 'W','b':'DOG'}]。

如何直接使用DataFrame的函数过滤数据？不需要循环。

预期输出:

id	a	b
1	HH	DOG
2	HH	CAT
3	W	DOG

现在我想要另一个过滤器来排除基于与合并相同逻辑的remove_dict = [{'a': 'HH', 'b':'CAT'}]。

预期输出:

id	a	b
1	HH	DOG
3	W	DOG

需求是我有一个巨大的数据框，我需要根据字典中的值（动态值和列）进行包含操作，然后再根据另一个字典进行排除操作。

- NITIN KOTHARI

1

你的预期输出是什么？ - tomjn

@tomjn 只在问题中进行了更新。 - NITIN KOTHARI

1

a = [{'a': 'HH','b': 'DOG'}, {'a': 'W','b':'DOG'}] 这个字典是用来得到这个输出的吗？ - Hamza usman ghani

@pythonic833 我认为OP想要根据列'a'和'b'的值进行过滤。 - Peter Curran

@NITINKOTHARI：正如Hamza usman ghani所指出的那样，请更新您的字典。 - pythonic833

谢谢大家，但我稍微修改了我的问题，只是想把@Hamzausmanghani也删除掉。 - NITIN KOTHARI

2个回答

2

如果您想要一个通用的解决方案，甚至不需要知道在filter_dict中指定的列，您可以使用双重reduce：

from functools import reduce
from operator import invert

def filter_df(df, filter_dict, option='keep'):
    slice_vector = reduce(lambda x, y: x | y, [reduce(lambda x, y: 
                                                      x & y, [df[col] == val for col, val in el.items()])
                          for el in filter_dict])
    if option == 'keep':
        return df.loc[slice_vector]
    elif option == 'exclude':
        return df.loc[invert(slice_vector)]
    else:
        NotImplementedError(f"Option {option} not implemented. Please choose between 'keep' and 'exclude'.")

让我们将其应用于各种测试案例：

data = {"id": [1,2,3], "a": ["HH", "HH", "W"], "b": ["DOG", "CAT", "DOG"]}
df = pd.DataFrame(data)

# test case 1
filter_dict_1 = [{'a': 'HH'}, {'a': 'W','b':'DOG'}]
df1 = filter_df(df, filter_dict_1, "keep")
print(df1)
#    id   a    b
# 0   1  HH  DOG
# 1   2  HH  CAT
# 2   3   W  DOG


# test case 2
filter_dict_2 = [{'a': 'HH', 'b': 'CAT'}]
df2 = filter_df(df, filter_dict_2, "exclude")
print(df2)
#   id   a    b
#0   1  HH  DOG
#2   3   W  DOG


# test case 3
filter_dict_3 = [{'a': 'HH', 'b':'CAT'}, {"a": 'HH'}]
df3 = filter_df(df, filter_dict_3, "exclude")
print(df3)
#   id  a    b
#2   3  W  DOG

我们的想法是首先根据单个字典创建一个布尔向量。这些向量是通过使用&组合单个条件来创建的，然后将这些向量与|组合以生成最终的过滤向量。

- pythonic833

@NITINKOTHARI，由于在您的更新问题中字典中键的数量可能会发生变化，因此我建议使用这个解决方案。更改逻辑运算符可以将逻辑从保留更改为删除。（我会为您更新本答案中的示例）。 - Peter Curran

非常感谢@PeterCurran和@pythonic833，这正好符合我提交代码时所需的要求。但是在尝试提交代码时出现了一个pylint问题：E1130：一元~~的操作数类型错误：对象（无效的一元操作数类型），针对这行代码："return df.loc[~~slice_vector]"。我不想禁用它，请问您能告诉我如何解决吗？ - NITIN KOTHARI

@NITINKOTHARI 不确定是什么原因导致的，但您可以尝试用 operator.invert(slice_vector) 替换 ~slice_vector。 - Peter Curran

@NITINKOTHARI：这似乎是numpy和pandas结构中已知的问题，其中pylint会抛出不必要的警告，如您在此处所见。在这里禁用它们是安全的。但我也根据Peter Curran的建议更新了答案，因此不应再抛出任何警告。 - pythonic833

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Peter Curran · Accepted Answer

如果我理解正确的话，您想从您的“筛选字典”创建一个数据框，然后根据感兴趣的列使用 pd.merge 来获取交集。要找到差异，可以重复相同的步骤，然后使用 id 列从原始数据框中删除交叉的 id。

import pandas as pd


def filter_df(df, filter_dict, option='keep'):

    x = pd.concat([pd.merge(df, pd.DataFrame([dic]), 
                            how ='inner',
                            on=list(dic.keys()))
                   for dic in filter_dict], ignore_index=True).drop_duplicates()
    if option == "keep":
        return x
    elif option == "exclude":
        return df[df["id"].isin(x["id"].values) == False]
    else:
        NotImplementedError(f"Option {option} not implemented. Please choose between 'keep' and 'exclude'.")

以下是测试用例:

data = {"id": [1,2,3], "a": ["HH", "HH", "W"], "b": ["DOG", "CAT", "DOG"]}
df = pd.DataFrame(data)

# test case 1
filter_dict_1 = [{'a': 'HH'}, {'a': 'W','b':'DOG'}]
df1 = filter_df(df, filter_dict_1, "keep")
print(df1)
# #    id   a    b
# 0   1  HH  DOG
# 1   2  HH  CAT
# 2   3   W  DOG


# test case 2
filter_dict_2 = [{'a': 'HH', 'b': 'CAT'}]
df2 = filter_df(df, filter_dict_2, "exclude")
print(df2)
#   id   a    b
#0   1  HH  DOG
#2   3   W  DOG


# # test case 3
filter_dict_3 = [{'a': 'HH', 'b':'CAT'}, {"a": 'HH'}]
df3 = filter_df(df, filter_dict_3, "exclude")
print(df3)
#   id  a    b
#2   3  W  DOG