如果数据框内部列表中的任何值在另一个列表中，则过滤pandas数据框行

Question

如果数据框内部列表中的任何值在另一个列表中，则过滤pandas数据框行

7

我有一个包含列表的 pandas dataframe，这个列表在 split_categories 列中：

df.head()

      album_id categories split_categories
    0    66562    480.494       [480, 494]
    1   114582        128            [128]
    2     4846          5              [5]
    3     1709          9              [9]
    4    59239    105.104       [105, 104]

我希望选择所有行，其中至少一个类别在特定列表[480、9、104]中。

预期输出：

  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]

我使用apply方法成功完成了它：

def match_categories(row):
    selected_categories =  [480, 9, 104]
    result = [int(i) for i in row['split_categories'] if i in selected_categories]
    return result

df['matched_categories'] = df.apply(match_categories, axis=1)

但是这段代码在生产环境中运行时间太长了（我要对包含列表的多列运行它）

有没有一种方法可以运行类似于：

df[~(df['split_categories'].anyvalue.isin([480, 9, 104]))]

谢谢

- Ary Jazz

df['split_categories']列表的最大大小是多少？例如，它总是1或2个项目吗？ - jpp

5个回答

2

你可以展开内部列表，并检查内部列表中是否包含[480, 9, 104]中的任何一项：

l = [480, 9, 104]
df[df.categories.str.split('.', expand=True).isin(map(str,l)).any(axis=1)]

   album_id  categories split_categories
0     66562     480.494        [480,494]
3      1709       9.000              [9]
4     59239     105.104        [105,104]

- yatu

1

df.split_categories.str.strip('[]') 返回一个 NaN 数组（split_categories 中的值已经是列表而不是字符串）我使用了 df[df.categories.str.split('.', expand=True).isin(map(str,l)).any(axis=1)]，并且它起作用了。谢谢。 - Ary Jazz

1

哦，我明白了，你必须使用split_categories，我已经更新了答案。 - yatu

2

避免一系列列表

您可以将其拆分为多个数字系列，然后使用矢量化布尔运算。使用逐行操作的Python级循环通常效率较低。

df = pd.DataFrame({'album_id': [66562, 114582, 4846, 1709, 59239],
                   'categories': ['480.494', '128', '5', '9', '105.104']})

split = df['categories'].str.split('.', expand=True).add_prefix('split_').astype(float)
df = df.join(split)

print(df)
#    album_id categories  split_0  split_1
# 0     66562    480.494    480.0    494.0
# 1    114582        128    128.0      NaN
# 2      4846          5      5.0      NaN
# 3      1709          9      9.0      NaN
# 4     59239    105.104    105.0    104.0

L = [480, 9, 104]
res = df[df.filter(regex='^split_').isin(L).any(1)]

print(res)
#    album_id categories  split_0  split_1
# 0     66562    480.494    480.0    494.0
# 3      1709          9      9.0      NaN
# 4     59239    105.104    105.0    104.0

- jpp

2

另一种方法：

my_list = [480, 9, 104]
pat = r'({})'.format('|'.join(str(i) for i in my_list))
#'(480|9|104)' <-- This is how the pat looks like
df.loc[df.split_categories.astype(str).str.extract(pat, expand=False).dropna().index]

或者：

pat = '|'.join(r"\b{}\b".format(x) for x in my_list)
df[df.split_categories.astype(str).str.contains(pat,na=False)]

    album_id    categories  split_categories
0   66562       480.494     [480, 494]
3   1709        9.000       [9]
4   59239       105.104     [105, 104]

这将适用于split_categories和categories列。

- anky

1

str.contains 更好 - jezrael

@jezrael 我已经尝试过了，但是收到了警告 UserWarning: This pattern has match groups. To actually get the groups, use str.extract. :( - anky

1

尝试使用以下代码：pat = '|'.join(r"\b{}\b".format(x) for x in L) - jezrael

@jezrael 可以，谢谢。 :) 会添加编辑。 :) 还在学习字符串格式化。 :D - anky

1

使用：

print(df[~(df['split_categories'].isin([480, 9, 104])).any()])

输出：

  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]

- U13-Forward

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

您可以将每个列表转换为集合，获取交集并转换为布尔值：

L = [480, 9, 104]
mask = np.array([bool(set(map(int, x)) & set(L))  for x in df['split_categories']])

或者将 列表列 转换为 DataFrame，转换为浮点数并与 isin 进行比较：

df1 = pd.DataFrame(df['split_categories'].values.tolist(), index=df.index)
mask = df1.astype(float).isin(L).any(axis=1)

df = df[mask]
print (df)
  album_id categories split_categories
0    66562    480.494       [480, 494]
3     1709          9              [9]
4    59239    105.104       [105, 104]