如何使用pandas选择所有非NaN列和最后一列的非NaN值？

Question

如何使用pandas选择所有非NaN列和最后一列的非NaN值？

7

如果标题有点令人困惑，请原谅。

假设我有test.h5文件。以下是使用df.read_hdf('test.h5', 'testdata')读取此文件的结果：

     0     1     2     3     4     5    6
0   123   444   111   321   NaN   NaN  NaN
1   12    234   113   67    21    32   900
3   212   112   543   321   45    NaN  NaN

我想选择最后一个非NaN列。我的期望结果是这样的：

0   321
1   900
2   45

我希望选择除了最后一个非NaN列以外的所有列。我的预期结果可能是这样的。它可以是numpy数组，但我还没有找到任何解决方案。

      0     1     2     3     4     5    6
0    123   444   111   
1    12    234   113   67    21    32  
3    212   112   543   321

我在网上搜索并找到了df.iloc[:, :-1]用于读取除最后一列外的所有列，df.iloc[:, -1]用于读取最后一列。

使用这两个命令得到的结果如下： 1. 读取除最后一列以外的所有列。

       0     1     2     3     4     5    
0     123   444   111   321   NaN   NaN  
1     12    234   113   67    21    32   
3     212   112   543   321   45    NaN

2. 读取最后一列

0   NaN
1   900
2   Nan

我的问题是，是否有任何在pandas中用于处理这些条件的命令或查询语句？

感谢任何帮助和建议。

- Fang

5个回答

6

第二部分

这里提供的是一种使用掩码向量化方式来执行第二个任务，选择除最后一个非NaN列以外的所有列：

idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)
df_out = df.mask(idx[:,None] <= np.arange(df.shape[1]))

这是对一个修改/通用版本的示例数据框进行运行的样本，第三行有两个NaN区域，第二行在开头有NaN区域 -

In [181]: df
Out[181]: 
     0      1      2    3     4     5      6
0  123  444.0  111.0  321   NaN   NaN    NaN
1   12    NaN    NaN   67  21.0  32.0  900.0
3  212    NaN    NaN  321  45.0   NaN    NaN

In [182]: idx = df.notnull().cumsum(1).idxmax(1).values.astype(int)

In [183]: df.mask(idx[:,None] <= np.arange(df.shape[1]))
Out[183]: 
     0      1      2      3     4     5   6
0  123  444.0  111.0    NaN   NaN   NaN NaN
1   12    NaN    NaN   67.0  21.0  32.0 NaN
3  212    NaN    NaN  321.0   NaN   NaN NaN

第一部分

回到解决第一个问题，只需使用NumPy的高级索引 -

In [192]: df.values[np.arange(len(idx)), idx]
Out[192]: array([ 321.,  900.,   45.])

- Divakar

你可以使用 notnull 替代 isnull 的否定。 - Zero

5

选项 1

df.stack().groupby(level=0).last()

0    321.0
1    900.0
3     45.0
dtype: float64

选项2
使用apply与pd.Series.last_valid_index

# Thanks to Bharath shetty for the suggestion
df.apply(lambda x : x[x.last_valid_index()], 1)
# Old Answer
# df.apply(pd.Series.last_valid_index, 1).pipe(lambda x: df.lookup(x.index, x))

array([ 321.,  900.,   45.])

选项3
通过np.where和字典推导式进行创意处理

pd.Series({df.index[i]: df.iat[i, j] for i, j in zip(*np.where(df.notnull()))})

0    321.0
1    900.0
3     45.0
dtype: float64

选项4
使用pd.DataFrame.ffill函数

df.ffill(1).iloc[:, -1]

0    321.0
1    900.0
3     45.0
Name: 6, dtype: float64

解决最后一个技巧

df.stack().groupby(level=0, group_keys=False).apply(lambda x: x.head(-1)).unstack()

       0      1      2      3     4     5
0  123.0  444.0  111.0    NaN   NaN   NaN
1   12.0  234.0  113.0   67.0  21.0  32.0
3  212.0  112.0  543.0  321.0   NaN   NaN

- piRSquared

1

对于选项2，只需使用df.apply(lambda x : x[x.last_valid_index()],1)。 - Bharath M Shetty

你为什么把它移除了？ - piRSquared

因为我认为它不会回答第二部分。 - Bharath M Shetty

是的，但仍然有趣。无论如何，如果你改变主意了我也很乐意删除它。 - piRSquared

不，先生，我想从您那里看到选项列表。让它在那里。 - Bharath M Shetty

4

使用 notnull + iloc + idxmax 来获取最后一个非 NaN 值的列名以及第一个和最后一个 lookup：

a = df.notnull().iloc[:,::-1].idxmax(1)
print (a)
0    3
1    6
3    4
dtype: object

print (pd.Series(df.lookup(df.index, a)))
0    321.0
1    900.0
2     45.0
dtype: float64

然后将这些值替换为NaN:

arr = df.values
arr[np.arange(len(df.index)),a] = np.nan
print (pd.DataFrame(arr, index=df.index, columns=df.columns))
       0      1      2      3     4     5   6
0  123.0  444.0  111.0    NaN   NaN   NaN NaN
1   12.0  234.0  113.0   67.0  21.0  32.0 NaN
3  212.0  112.0  543.0  321.0   NaN   NaN NaN

- jezrael

0

对于那些正在寻找此特定问题答案的人，我最终使用了Bharath Shetty提供的答案。为了更方便以后访问，我修改了给出的答案，以下是我的代码：

#assuming you have some csv file with different length of row/column
#and you want to create h5 file from those csv files
data_one = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_one/*.csv")]
data_two = [np.loadtxt(file) for file in glob.glob(yourpath + "folder_two/*.csv")] 

df1 = pd.DataFrame(data_one)
df2 = pd.DataFrame(data_two)

combine = df1.append(df2, ignore_index=True)
combine_sort = combine.apply(lambda x : sorted(x, key=pd.notnull), 1)
combine.to_hdf('test.h5', 'testdata')

阅读

dataframe = pd.read_hdf('test.h5', 'testdata')
dataset = dataframe.values

q1 = dataset[:, :-1] # return all column except the last column
q2 = dataset[:, -1] # return the last column

- Fang

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Bharath M Shetty · Accepted Answer

您可以使用 sorted 来满足您的条件，即

ndf = df.apply(lambda x : sorted(x,key=pd.notnull),1)

现在你可以选择最后一列，即

     0      1      2      3      4      5      6
0   NaN    NaN    NaN  123.0  444.0  111.0  321.0
1  12.0  234.0  113.0   67.0   21.0   32.0  900.0
3   NaN    NaN  212.0  112.0  543.0  321.0   45.0

ndf.iloc[:,-1]

0    321.0
1    900.0
3     45.0
名称：6，数据类型：浮点数

ndf.iloc[:,:-1].apply(lambda x : sorted(x,key=pd.isnull),1)

      0      1      2      3     4     5
0  123.0  444.0  111.0    NaN   NaN   NaN
1   12.0  234.0  113.0   67.0  21.0  32.0
3  212.0  112.0  543.0  321.0   NaN   NaN

以上为数据表格，其中NaN表示缺失值。