使用sklearn的KFold拆分pandas数据框。

Question

使用sklearn的KFold拆分pandas数据框。

25

我已经使用以下代码获取了训练集和测试集的索引。

df = pandas.read_pickle(filepath + filename)
kf = KFold(n_splits = n_splits, shuffle = shuffle, random_state = 
randomState)

result = next(kf.split(df), None)

#train can be accessed with result[0]
#test can be accessed with result[1]

我想知道是否有更快的方法，可以使用我检索到的行索引将它们分别分隔成两个数据框。

- Mervyn Lee

3个回答

0

如果你想要一个简单的一行代码，可以使用列表推导式。

train, test = [df.iloc[ind] for ind in next(kf.split(df))]

然而，如果你想将一个数据框分成两个部分，train_test_split 可能是一个更简单的选择，因为它实际上是 next(ShuffleSplit().split(df)) 的包装器。

如果你想要恢复 KFold 的所有拆分（可能要传递给另一个模型），那么循环可能会有用。在这里，每次迭代中，一个折叠将成为验证集。

kf = KFold(n_splits=5, shuffle=True)

for i, (t_ind, v_ind) in enumerate(kf.split(df)):
    
    train = df.iloc[t_ind]     # train set
    valid = df.iloc[v_ind]     # validation set
    
    result = my_model(train, valid)

另一个使用splits生成器循环的用例是为折叠创建新列。

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold

np.random.seed(100)
df = pd.DataFrame(np.random.randint(4,10, size=(7,3)), columns=list('ABC'))
kf = KFold(n_splits=4, shuffle=True, random_state=0)

for i, (_, v_ind) in enumerate(kf.split(df)):
    df.loc[df.index[v_ind], 'kfold'] = f"fold{i+1}"

- cottontail

0

我的回答与问题标题无关，但如果你想获得训练和测试集，可以使用sklearn.model_selection中的train_test_split。

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(77)
df = pd.DataFrame(np.random.random((10,3)), columns=('one', 'two', 'three'))
print(df)
         one         two       three
0   0.919109    0.642196    0.753712
1   0.139315    0.087320    0.788002
2   0.326151    0.541068    0.240235
3   0.545423    0.400555    0.715192
4   0.836680    0.588481    0.296155
5   0.281018    0.705597    0.422596
6   0.057316    0.747027    0.452313
7   0.175775    0.049377    0.292475
8   0.066799    0.751156    0.063772
9   0.431908    0.364172    0.151972

df_train, df_test = train_test_split(df, test_size=0.3, random_state=77)
print(df_train)
print(df_test)
        one       two     three
6  0.057316  0.747027  0.452313
0  0.919109  0.642196  0.753712
5  0.281018  0.705597  0.422596
3  0.545423  0.400555  0.715192
8  0.066799  0.751156  0.063772
4  0.836680  0.588481  0.296155
7  0.175775  0.049377  0.292475
        one       two     three
2  0.326151  0.541068  0.240235
1  0.139315  0.087320  0.788002
9  0.431908  0.364172  0.151972

- x3mEr

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

您需要使用 DataFrame.iloc 来按位置选择行：

示例：

np.random.seed(100)
df = pd.DataFrame(np.random.random((10,5)), columns=list('ABCDE'))
df.index = df.index * 10
print (df)
           A         B         C         D         E
0   0.543405  0.278369  0.424518  0.844776  0.004719
10  0.121569  0.670749  0.825853  0.136707  0.575093
20  0.891322  0.209202  0.185328  0.108377  0.219697
30  0.978624  0.811683  0.171941  0.816225  0.274074
40  0.431704  0.940030  0.817649  0.336112  0.175410
50  0.372832  0.005689  0.252426  0.795663  0.015255
60  0.598843  0.603805  0.105148  0.381943  0.036476
70  0.890412  0.980921  0.059942  0.890546  0.576901
80  0.742480  0.630184  0.581842  0.020439  0.210027
90  0.544685  0.769115  0.250695  0.285896  0.852395

from sklearn.model_selection import KFold

#added some parameters
kf = KFold(n_splits = 5, shuffle = True, random_state = 2)
result = next(kf.split(df), None)
print (result)
(array([0, 2, 3, 5, 6, 7, 8, 9]), array([1, 4]))

train = df.iloc[result[0]]
test =  df.iloc[result[1]]

print (train)
           A         B         C         D         E
0   0.543405  0.278369  0.424518  0.844776  0.004719
20  0.891322  0.209202  0.185328  0.108377  0.219697
30  0.978624  0.811683  0.171941  0.816225  0.274074
50  0.372832  0.005689  0.252426  0.795663  0.015255
60  0.598843  0.603805  0.105148  0.381943  0.036476
70  0.890412  0.980921  0.059942  0.890546  0.576901
80  0.742480  0.630184  0.581842  0.020439  0.210027
90  0.544685  0.769115  0.250695  0.285896  0.852395

print (test)
           A         B         C         D         E
10  0.121569  0.670749  0.825853  0.136707  0.575093
40  0.431704  0.940030  0.817649  0.336112  0.175410