使用gridSearchCV进行孤立森林参数调整

Question

使用gridSearchCV进行孤立森林参数调整

9

我有多变量时间序列数据，希望使用孤立森林算法检测异常值。希望从gridSearchCV中获取最佳参数，以下是gridSearch CV的代码片段。

以下代码片段加载输入数据集。

df = pd.read_csv("train.csv")
df.drop(['dataTimestamp','Anomaly'], inplace=True, axis=1)
X_train = df
y_train = df1[['Anomaly']] ( Anomaly column is labelled data).

定义孤立森林的参数。

clf = IsolationForest(random_state=47, behaviour='new', score="accuracy")
param_grid = {'n_estimators': list(range(100, 800, 5)), 'max_samples': list(range(100, 500, 5)), 'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 'max_features': [5,10,15], 'bootstrap': [True, False], 'n_jobs': [5, 10, 20, 30]}

f1sc = make_scorer(f1_score)
grid_dt_estimator = model_selection.GridSearchCV(clf, param_grid,scoring=f1sc, refit=True,cv=10, return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)

执行fit后，出现以下错误：

ValueError：目标是多类别但平均值为'binary'。请选择另一个平均设置。

请问有人可以指导我这是怎么回事，我尝试了average ='weight'，但仍然没有成功，我在做错什么吗？请告诉我如何获得F-score。

- Anantha

3个回答

2

使用此代码更新make_scorer以使其正常工作。

make_scorer(f1_score, average='micro')

- Gayathri Manohar

你能帮我解决这个问题吗？我尝试了你的解决方案，但它并没有起作用。我的数据没有标签。 - taga

1

您需要调整的参数并非全部都是必要的。
例如：
contamination 是异常值的比率，您可以在拟合模型后通过调整阈值来确定最佳值 model.score_samples。

n_jobs 是您使用的 CPU 核心数。

- Joey Gao

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Luca Massaron · Accepted Answer

由于您在将f1_score转换为评分器时未设置参数average，因此出现了此错误。实际上，正如文档中详细说明的那样：

平均值：字符串，[无，'binary'（默认），'micro'，'macro'， 'samples'，'weighted'] 对于多类/多标签目标需要此参数。如果为空，则返回每个类别的分数。

结果是评分器为您的分类问题中的每个类返回多个分数，而不是单个度量。解决方案是声明f1_score的average参数的可能值之一，具体取决于您的需求。因此，我重新构建了您提供的示例代码，以便为您的问题提供可能的解决方案：

from sklearn.ensemble import IsolationForest
from sklearn.metrics import make_scorer, f1_score
from sklearn import model_selection
from sklearn.datasets import make_classification

X_train, y_train = make_classification(n_samples=500, 
                                       n_classes=2)

clf = IsolationForest(random_state=47, behaviour='new')

param_grid = {'n_estimators': list(range(100, 800, 5)), 
              'max_samples': list(range(100, 500, 5)), 
              'contamination': [0.1, 0.2, 0.3, 0.4, 0.5], 
              'max_features': [5,10,15], 
              'bootstrap': [True, False], 
              'n_jobs': [5, 10, 20, 30]}

f1sc = make_scorer(f1_score(average='micro'))

grid_dt_estimator = model_selection.GridSearchCV(clf, 
                                                 param_grid,
                                                 scoring=f1sc, 
                                                 refit=True,
                                                 cv=10, 
                                                 return_train_score=True)
grid_dt_estimator.fit(X_train, y_train)

使用gridSearchCV进行孤立森林参数调整

即使将其更改为-1和1，我仍然收到相同的错误 Counter({-1: 250, 1: 250})