在pandas中过滤数据框:使用条件列表

12

我有一个带有两个维度'col1'和'col2'的pandas数据框。

我可以使用以下方法筛选这两列的特定值:

df[ (df["col1"]=='foo') & (df["col2"]=='bar')]

有没有办法同时筛选这两列?

我尝试过使用数据帧的限制条件来仅筛选两列,但是我的最佳猜测并不能满足等式的第二个部分:

df[df[["col1","col2"]]==['foo','bar']]

产生了这个错误

ValueError: Invalid broadcasting comparison [['foo', 'bar']] with block values

我需要这样做是因为列的名称以及设置条件的列数会有所变化。

5个回答

8
据我所知,在Pandas中没有办法实现你想要的功能。不过,虽然下面的解决方案可能不是最美观的,但你可以按照以下方式压缩一组并行列表:
cols = ['col1', 'col2']
conditions = ['foo', 'bar']

df[eval(" & ".join(["(df['{0}'] == '{1}')".format(col, cond) 
   for col, cond in zip(cols, conditions)]))]

字符串连接的结果如下:
>>> " & ".join(["(df['{0}'] == '{1}')".format(col, cond) 
    for col, cond in zip(cols, conditions)])

"(df['col1'] == 'foo') & (df['col2'] == 'bar')"

您可以使用 eval 函数对其进行求值,实现如下效果:
df[eval("(df['col1'] == 'foo') & (df['col2'] == 'bar')")]

例如:

df = pd.DataFrame({'col1': ['foo', 'bar, 'baz'], 'col2': ['bar', 'spam', 'ham']})

>>> df
  col1  col2
0  foo   bar
1  bar  spam
2  baz   ham

>>> df[eval(" & ".join(["(df['{0}'] == {1})".format(col, repr(cond)) 
            for col, cond in zip(cols, conditions)]))]
  col1 col2
0  foo  bar

一个很好的解决方法,完美地满足了我的需求。我接受它,因为似乎没有纯粹的“在pandas内部”的答案。 - WNG
我有一个包含数字值的列,代码无法处理,所以我使用repr使其更通用一些。 - WNG
1
df[(df[["col1","col2"]].values==['foo','bar']).all(1)] - Shiang Hoo

7

我想指出一种替代接受答案的方法,因为在解决这个问题时不需要使用eval

from functools import reduce

df = pd.DataFrame({'col1': ['foo', 'bar', 'baz'], 'col2': ['bar', 'spam', 'ham']})
cols = ['col1', 'col2']
values = ['foo', 'bar']
conditions = zip(cols, values)

def apply_conditions(df, conditions):
    assert len(conditions) > 0
    comps = [df[c] == v for c, v in conditions]
    result = comps[0]
    for comp in comps[1:]:
        result &= comp
    return result

def apply_conditions(df, conditions):
    assert len(conditions) > 0
    comps = [df[c] == v for c, v in conditions]
    return reduce(lambda c1, c2: c1 & c2, comps[1:], comps[0])

df[apply_conditions(df, conditions)]

0

我知道在这件事上我来晚了,但是如果你知道所有的值都使用相同的符号,那么你可以使用functools.reduce。我有一个包含大约64列的CSV文件,我完全没有复制和粘贴它们的意愿。这是我的解决方法:

from functools import reduce

players = pd.read_csv('players.csv')

# I only want players who have any of the outfield stats over 0.
# That means they have to be an outfielder.
column_named_outfield = lambda x: x.startswith('outfield')

# If a column name starts with outfield, then it is an outfield stat. 
# So only include those columns
outfield_columns = filter(column_named_outfield, players.columns)

# Column must have a positive value
has_positive_value = lambda c:players[c] > 0
# We're looking to create a series of filters, so use "map"
list_of_positive_outfield_columns = map(has_positive_value, outfield_columns)

# Given two DF filters, this returns a third representing the "or" condition.
concat_or = lambda x, y: x | y
# Apply the filters through reduce to create a primary filter
is_outfielder_filter = reduce(concat_or, list_of_positive_outfield_columns)
outfielders = players[is_outfielder_filter]

0
这是一个相当简洁的解决方案,如果你的连接操作符(&|)对于所有过滤器都是相同的。
cols = ['col1', 'col2']
conditions = ['foo', 'bar']

filtered_rows = True
for col, condition in zip(cols, conditions):
    # update filtered_rows with each filter condition
    current_filter = (df[col] == condition)
    filtered_rows &= current_filter

df = df[filtered_rows]

0

发帖原因是我遇到了类似的问题,并找到了一个解决方案,在一行代码中完成,尽管有点低效。

cols, vals = ["col1","col2"],['foo','bar']
pd.concat([df.loc[df[cols[i]] == vals[i]] for i in range(len(cols))], join='inner')

这实际上是跨列的 &。如果要在列之间使用 |,可以省略 join='inner' 并在末尾添加 drop_duplicates()


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接