Pandas DataFrame：根据现有列的值检查，将值写入新的列

Question

Pandas DataFrame：根据现有列的值检查，将值写入新的列

4

我想在pd.DataFrame中添加一列，根据现有列中的检查写入值。

我想检查字典中的值。假设我有以下字典：

{"<=4":[0,4], "(4,10]":[4,10], ">10":[10,inf]}

现在我想检查DataFrame中的一列，如果该列中的值属于字典中的任何一个区间，则将匹配的字典键写入同一数据框中的第二列。

因此，DataFrame看起来像这样：

will become:

     col_1   col_2
  a    3     "<=4"
  b    15    ">10"
  c    8     "(4,10]"

- farnold

希望以下内容能够帮助到您。 - Colonel Beauvel

4个回答

1

你可以使用这种方法：

dico = pd.DataFrame({"<=4":[0,4], "(4,10]":[4,10], ">10":[10,float('inf')]}).transpose()

foo = lambda x: dico.index[(dico[1]>x) & (dico[0]<=x)][0]

df['col_1'].map(foo)

#0       <=4
#1       >10
#2    (4,10]
#Name: col1, dtype: object

- Colonel Beauvel

1

这个解决方案创建了一个名为extract_str的函数，应用于col_1。它使用条件列表推导式来遍历字典中的键和值，检查该值是否大于或等于下限值且小于上限值。检查结果列表以确保其不包含多个结果。如果列表中有一个值，则返回该值。否则，默认情况下返回None。

from numpy import inf

d = {"<=4": [0, 4], "(4,10]": [4, 10], ">10": [10, inf]}

def extract_str(val):
    results = [key for key, value_range in d.iteritems() 
               if value_range[0] <= val < value_range[1]]
    if len(results) > 1:
        raise ValueError('Multiple ranges satisfied.')
    if results:
        return results[0]

df['col_2'] = df.col_1.apply(extract_str)

>>> df
   col_1   col_2
a      3     <=4
b     15     >10
c      8  (4,10]

在这个小数据框中，这个解决方案比@ColonelBeauvel提供的解决方案快得多。

%timeit df['col_2'] = df.col_1.apply(extract_str)
1000 loops, best of 3: 220 µs per loop

%timeit df['col_2'] = df['col_1'].map(foo)
1000 loops, best of 3: 1.46 ms per loop

- Alexander

谢谢你的回答！我发现@Nader Hisham的答案对于原问题的解决方案更加优雅。然而，你的答案在另一个问题上帮了我很多，即比较DataFrame列与dict（like）对象！ - farnold

0

你可以使用函数进行映射，就像例子中一样。希望能对你有所帮助。

import pandas as pd
d = {'col_1':[3,15,8]}
from numpy import inf
test = pd.DataFrame(d,index=['a','b','c'])
newdict = {"<=4":[0,4], "(4,10]":[4,10], ">10":[10,inf]}

def mapDict(num):
    print(num)
    for key,value in newdict.items():
        tmp0 = value[0]
        tmp1 = value[1]
        if num == 0:
            return "<=4"
        elif (num> tmp0) & (num<=tmp1):
            return key

test['col_2']=test.col_1.map(mapDict)

那么测试将变成：

  col_1 col_2
a   3   <=4
b   15  >10
c   8   (4,10]

附言：我想知道在 Stack Overflow 如何快速编写代码，有人可以告诉我窍门吗？

- lai_bluejay

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nader Hisham · Accepted Answer

pd.cut() 函数用于将连续变量转换为分类变量，本例中的参数为 [0 , 4 , 10 , np.inf]，表示有三个分类 [0 , 4] ，[4 , 10] ，[10 , inf]，因此任何值介于 0 和 4 之间的都将被分配到类别 [ 0 , 4] 中，介于 4 和 10 之间的都将被分配到类别 [ 4 , 10 ] 中，以此类推。

然后按照相同的顺序为每个类别指定一个名称，可以使用标签参数来实现，在本例中我们有三个分类 [0 , 4] ，[4 , 10] ，[10 , inf]，我们将简单地将 ['<=4' , '(4,10]' , '>10'] 分配给标签参数，这意味着 [0 , 4] 类别将被命名为 <=4，[4 , 10] 类别将被命名为 (4,10]，以此类推。

In [83]:
df['col_2'] = pd.cut(df.col_1 , [0 , 4 , 10 , np.inf] , labels = ['<=4' , '(4,10]' , '>10'] )
df
Out[83]:
   col_1    col_2
0   3       <=4
1   15      >10
2   8       (4,10]