按组从Pandas数据框中删除NaN。

3

我有一个Dataframe,其中特定列中有一些NaN值(Dataframe的样子如下,顺便说一句,实际上Dataframe比我下面展示的要大得多):

    source  battery   Temperature  time                      Distance
0   83512   98.0         NaN       2019-10-26T00:00:06.494Z   NaN
1   83512   NaN          23.0      2019-10-26T00:00:06.538Z   NaN
2   83512   NaN          NaN       2019-10-26T00:00:06.577Z   21.0
3   83512   98.0         NaN       2019-10-26T00:30:06.702Z   NaN
4   83512   NaN          23.0      2019-10-26T00:30:06.743Z   NaN
5   83512   NaN          NaN       2019-10-26T00:30:06.781Z   21.0
6   83512   98.0         NaN       2019-10-26T01:00:08.955Z   NaN
7   83512   NaN          23.0      2019-10-26T01:00:08.998Z   NaN
8   83512   NaN          NaN       2019-10-26T01:00:09.039Z   21.0

我正在寻找一种方法来缩小框架,使其看起来更像这样:
    source  battery   Temperature  time                      Distance
0   83512   98.0         23.0      2019-10-26T00:00:06.494Z  21.0     
1   83512   98.0         23.0      2019-10-26T00:30:06.702Z  21.0
2   83512   98.0         23.0      2019-10-26T01:00:08.955Z  21.0

换言之,我正在尝试从电池温度和距离列中删除NaN值,如果时间读数几乎相似(例如,时间= 2019-10-26T00:00:06.494Z, 2019-10-26T00:00:06.538Z, 2019-10-26T00:00:06.577Z),获取所有对应的值(源,电池,温度,时间和距离)。这是我目前为止的成果。
enter code here
from pandas.io.json import json_normalize
import json
import pandas as pd
import requests

URL = 'https://xxxxx.com'
req = requests.get(URL,auth=('xxx', 'xxx') )
text_data= req.text
json_dict= json.loads(text_data)
df= json_normalize(json_dict['measurements'])
df = df.rename(columns={'source.id': 'source', 'battery.percent.value': 'battery', 'c8y_TemperatureMeasurement.T.value': 'Temperature Or T','c8y_DistanceMeasurement.distance.value':'Distance'})
cols_to_keep =['source' ,'battery', 'Temperature Or T', 'time', 'Distance']
df_final = df[cols_to_keep] 
# this line doesnt give me the expected output 
df1 = df_final.apply(lambda x: pd.Series(x.dropna().values))
1个回答

2
你可以创建一个自定义分组器,检查time列中的时间差,设置一个阈值(例如这里是10分钟),以确定分组,并使用结果保留first有效行。
g = pd.to_datetime(df['time']).diff().gt(pd.Timedelta(10, 'min')).cumsum()
df.groupby(g).first()

      source  battery  Temperature                      time  Distance
time                                                                  
0      83512     98.0         23.0  2019-10-26T00:00:06.494Z      21.0
1      83512     98.0         23.0  2019-10-26T00:30:06.702Z      21.0
2      83512     98.0         23.0  2019-10-26T01:00:08.955Z      21.0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接