具有不平衡类别（正类是少数类）、低精度和奇怪分数分布的随机森林

Question

具有不平衡类别（正类是少数类）、低精度和奇怪分数分布的随机森林

3

我有一个非常不平衡的数据集（5000个正样本，300000个负样本）。我正在使用sklearn RandomForestClassifier来尝试预测正类的概率。我有多年的数据，并且我构建的一个特征是前一年的类别，因此我将数据集的最后一年保留下来进行测试，以及在我训练的年份中使用的测试集。

我尝试了以下方法（和结果）：

使用SMOTE和SMOTEENN过采样（奇怪的分数分布，参见第一张图片，正类和负类的预测概率都相同，即模型对大多数正类预测出很低的概率）

对数据进行降采样，使其平衡（测试集的召回率约为0.80，但由于未平衡的跨年测试集中负样本总数太多，召回率仅为0.07，参见第二张图片）

保持不平衡（再次出现奇怪的得分分布，精确度提高到约0.60，而测试集和跨年测试集的召回率下降至0.05和0.10）

尝试XGBoost算法（跨年测试集上的召回率略有提高，为0.11）

接下来应该尝试什么？我想优化F1值，因为在我的情况下假阳性和假阴性都同样糟糕。我想将k-fold交叉验证纳入其中，并且已经了解到应该在过采样之前进行交叉验证。 a）我该怎么做，这是有帮助的吗？ b）我该如何将其纳入类似于以下的流程中：

from imblearn.pipeline import make_pipeline, Pipeline

clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)

pipeline = make_pipeline(??)

pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)

score_dist

- Man

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jordan Delbar · Accepted Answer

使用SMOTE和SMOTEENN进行上采样：我远非这方面的专家，但通过对数据集进行上采样，您可能会放大现有的噪声，从而引起过拟合。这可能解释了您的算法无法正确分类，从而给出第一个图表中的结果。

我在这里找到了更多信息，也许可以改善您的结果： https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf

当您进行下采样时，似乎遇到了与我理解的相同的过拟合问题（至少是针对前一年的目标结果）。然而，在没有查看数据的情况下很难推断其原因。
您的过拟合问题可能来自于您使用的特征数量，这可能会增加不必要的噪音。您可以尝试减少使用的特征数量，并逐渐增加它（使用RFE模型）。更多信息，请参见：

https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/

对于您使用的模型，您提到了随机森林和XGBoost，但您没有提到是否使用过更简单的模型。您可以尝试使用更简单的模型，并专注于数据工程方面。如果您还没有尝试过，也许您可以：

Downsample your data
Normalize all your data with a StandardScaler

Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression

# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
         ('log_reg', LogisticRegression())]

pipeline = Pipeline(steps)

# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
              'penalty':['l1', 'l2']}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, 
random_state=42)

# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)

# Fit to the training set
cv.fit(X_train, y_train)

无论如何，对于您的示例，管道可以是（我用逻辑回归制作了它，但您可以使用另一个ML算法并相应更改参数网格）：

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score

from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline

param_grid = {'C': [1, 10, 100]}

clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")

pipeline = Pipeline([('scale', StandardScaler()),
                     ('SMOTEENN', sme),
                     ('grid', grid)])

cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)

我希望这能对您有所帮助。

（编辑：我在GridSearchCV中添加了score =“f1”）