新手学习pandas。一个(微不足道的)问题:主机、操作、执行时间。我想按主机分组,然后按主机+操作分组,计算每个主机执行时间的标准偏差,然后按主机+操作对进行分组。看起来很简单?
对于单列分组,它是有效的:
不错。现在:
咦?为什么会出现这个异常?
更多问题:
如何在
如何将计算限制在选定的列上?例如,在这里对日期/时间戳计算标准差显然没有意义。
对于单列分组,它是有效的:
df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial 132564 non-null values
host 132564 non-null values
idnum 132564 non-null values
operation 132564 non-null values
time 132564 non-null values
...
dtypes: float32(1), int64(2), object(6)
byhost = df.groupby('host')
byhost.std()
Out[362]:
datespecial idnum time
host
ahost1.test 11946.961952 40367.033852 0.003699
host1.test 15484.975077 38206.578115 0.008800
host10.test NaN 37644.137631 0.018001
...
不错。现在:
byhostandop = df.groupby(['host', 'operation'])
byhostandop.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
386 # todo, implement at cython level?
387 if ddof == 1:
--> 388 return self._cython_agg_general('std')
389 else:
390 f = lambda x: x.std(ddof=ddof)
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
1615
1616 def _cython_agg_general(self, how, numeric_only=True):
-> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
1618 return self._wrap_agged_blocks(new_blocks)
1619
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
1653 values = com.ensure_float(values)
1654
-> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
1656
1657 # see if we can cast the block back to the original dtype
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
838 if is_numeric:
839 result = lib.row_bool_subset(result,
--> 840 (counts > 0).view(np.uint8))
841 else:
842 result = lib.row_bool_subset_object(result,
/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
咦?为什么会出现这个异常?
更多问题:
如何在
dataframe.groupby([several columns])
上计算标准差?如何将计算限制在选定的列上?例如,在这里对日期/时间戳计算标准差显然没有意义。
byhostandop['time'].apply(lambda x:x.std())
吗?只是出于好奇。 - Roman Pekardf.astype
的回复中,你是不是指我应该显式地转换列类型?像这样就可以:df['time'] = df['time'].astype('float64'); byhostandop=df.groupby(['host', 'operation']); byhostandop['time'].std()
。但我不确定这是否是 pandas 操作中的惯用方式,或者我最好做些其他的事情以获得正确的(float64)类型的列用于标准差计算。 - LetMeSOThat4Unp.Float32
数据,错误ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
仍然存在。临时解决方案是:grouped = df.astype(np.float64).groupby(...)
(假设所有数据都是浮点数)。 - eldad-a