在Pandas数据框中给子集行赋值。

Question

在Pandas数据框中给子集行赋值。

4

我希望能够在Pandas DataFrame中基于索引条件分配值。

class test():
    def __init__(self):
        self.l = 1396633637830123000
        self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
        self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])

    def update(self):
        self.dfa['true'] = False
        self.dfa['idx'] = np.nan
        for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
            self.dfa.ix[beg:end]['true'] = True
            self.dfa.ix[beg:end]['idx'] = i

    def do(self):
        self.update()
        print self.dfa

t = test()
t.do()

结果：

                      A   B   true  idx
1396633637830123000   0   1  False  NaN
1396633637830123001   2   3   True  NaN
1396633637830123002   4   5   True  NaN
1396633637830123003   6   7   True  NaN
1396633637830123004   8   9  False  NaN
1396633637830123005  10  11  False  NaN
1396633637830123006  12  13   True  NaN
1396633637830123007  14  15   True  NaN
1396633637830123008  16  17   True  NaN
1396633637830123009  18  19   True  NaN

true列被正确地分配，而idx列没有。此外，这似乎取决于如何初始化列，因为如果我这样做:

    def update(self):
        self.dfa['true'] = False
        self.dfa['idx'] = False

此外，true列没有被正确地分配。

我做错了什么？

附注：预期结果为：

                      A   B   true  idx
1396633637830123000   0   1  False  NaN
1396633637830123001   2   3   True  0
1396633637830123002   4   5   True  0
1396633637830123003   6   7   True  0
1396633637830123004   8   9  False  NaN
1396633637830123005  10  11  False  NaN
1396633637830123006  12  13   True  1
1396633637830123007  14  15   True  1
1396633637830123008  16  17   True  1
1396633637830123009  18  19   True  1

编辑：我尝试使用loc和iloc进行分配，但似乎不起作用：

self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i

iloc:

self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i

- Fra

你正在使用链式索引，请参考此处：http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy，但是在多数据类型的框架中无法正常工作。请尝试使用“df.loc[row_indexer,col_indexer] = value”。 - Jeff

是的，我看过了，但我不知道如何解决它。如果dfb使用标签索引值，我该如何获得row_indexer、col_indexer呢？找到了：self.dfa.index.get_loc(beg) - Fra

此外，如果我使用 pd.set_option('mode.chained_assignment','warn')，我不会收到任何警告。 - Fra

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jeff · Accepted Answer

你正在进行链式索引操作，请参见这里。警告不是一定会出现的。

你可以尝试直接进行操作，没有必要实际追踪b中的索引。

In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))

In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])

In [46]: dfa['in_b'] = False

In [47]: for i, s in dfb.iterrows():
   ....:     dfa.loc[s['beg']:s['end'],'in_b'] = True
   ....:

如果您使用的是非整数数据类型，则可以使用以下方法：

In [36]: for i, s in dfb.iterrows():
             dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True


In [48]: dfa
Out[48]: 
                      A   B  in_b
1396633637830123000   0   1  False
1396633637830123001   2   3  True
1396633637830123002   4   5  True
1396633637830123003   6   7  True
1396633637830123004   8   9  False
1396633637830123005  10  11  False
1396633637830123006  12  13  True
1396633637830123007  14  15  True
1396633637830123008  16  17  True
1396633637830123009  18  19  True

[10 rows x 3 columns

如果b非常大，这可能不太高效。另外，这些看起来像纳秒时间。可以通过转换使它们更加友好。

In [49]: pd.to_datetime(dfa.index)
Out[49]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None