一个类支持向量机算法运行太慢

Question

一个类支持向量机算法运行太慢

machine-learningscikit-learnsvmanomaly-detection

8

以下数据显示了我的数据集的一部分，用于检测异常值。

    describe_file   data_numbers    index
0   gkivdotqvj      7309.0          0
1   hpwgzodlky      2731.0          1
2   dgaecubawx      0.0             2
3   NaN             0.0             3
4   lnpeyxsrrc      0.0             4

我使用了One Class SVM算法来检测异常。

from pyod.models.ocsvm import OCSVM
random_state = np.random.RandomState(42)     
outliers_fraction = 0.05
classifiers = {
        'One Classify SVM (SVM)':OCSVM(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, contamination=outliers_fraction)
}

X = data['data_numbers'].values.reshape(-1,1)   

for i, (clf_name, clf) in enumerate(classifiers.items()):
    clf.fit(X)
    # predict raw anomaly score
    scores_pred = clf.decision_function(X) * -1

    # prediction of a datapoint category outlier or inlier
    y_pred = clf.predict(X)
    n_inliers = len(y_pred) - np.count_nonzero(y_pred)
    n_outliers = np.count_nonzero(y_pred == 1)

    # copy of dataframe
    dfx = data[['index', 'data_numbers']]
    dfx['outlier'] = y_pred.tolist()
    IX1 =  np.array(dfx['data_numbers'][dfx['outlier'] == 0]).reshape(-1,1)
    OX1 =  dfx['data_numbers'][dfx['outlier'] == 1].values.reshape(-1,1)         
    print('OUTLIERS : ',n_outliers,'INLIERS : ',n_inliers, clf_name)    
    # threshold value to consider a datapoint inlier or outlier
    threshold = stats.scoreatpercentile(scores_pred,100 * outliers_fraction) 

tOut = stats.scoreatpercentile(dfx[dfx['outlier'] == 1]['data_numbers'], np.abs(threshold))

y = dfx['outlier'].values.reshape(-1,1)
def severity_validation():
    tOUT10 = tOut+(tOut*0.10)    
    tOUT23 = tOut+(tOut*0.23)
    tOUT45 = tOut+(tOut*0.45)
    dfx['test_severity'] = "None"
    for i, row in dfx.iterrows():
        if row['outlier']==1:
            if row['data_numbers'] <=tOUT10:
                dfx['test_severity'][i] = "Low Severity" 
            elif row['data_numbers'] <=tOUT23:
                dfx['test_severity'][i] = "Medium Severity" 
            elif row['data_numbers'] <=tOUT45:
                dfx['test_severity'][i] = "High Severity" 
            else:
                dfx['test_severity'][i] = "Ultra High Severity" 

severity_validation()

from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(dfx[['index','data_numbers']], dfx.outlier, test_size=0.25, 
                                                    stratify=dfx.outlier, random_state=30)

#Instantiate Classifier
normer = preprocessing.Normalizer()
svm1 = svm.SVC(probability=True, class_weight={1: 10})

cached = mkdtemp()
memory = Memory(cachedir=cached, verbose=3)
pipe_1 = Pipeline(steps=[('normalization', normer), ('svm', svm1)], memory=memory)

cv = skl.model_selection.KFold(n_splits=5, shuffle=True, random_state=42)

param_grid = [ {"svm__kernel": ["linear"], "svm__C": [0.5]}, {"svm__kernel": ["rbf"], "svm__C": [0.5], "svm__gamma": [5]} ]
grd = GridSearchCV(pipe_1, param_grid, scoring='roc_auc', cv=cv)

#Training
y_pred = grd.fit(X_train, Y_train).predict(X_test)
rmtree(cached)

#Evaluation
confmatrix = skl.metrics.confusion_matrix(Y_test, y_pred)
print(confmatrix)
Y_pred = grd.fit(X_train, Y_train).predict_proba(X_test)[:,1] 
def plot_roc(y_test, y_pred):
    fpr, tpr, thresholds = skl.metrics.roc_curve(y_test, y_pred, pos_label=1)
    roc_auc = skl.metrics.auc(fpr, tpr)
    plt.figure()
    lw = 2
    plt.plot(fpr, tpr, color='darkorange', lw=lw, label='ROC curve (area ={0:.2f})'.format(roc_auc))
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic example')
    plt.legend(loc="lower right")
    plt.show();
plot_roc(Y_test, Y_pred)

我的数据集非常大，有数百万行。因此，我只能运行几十万行。

代码可以正常运行，但是需要的时间太长了。因此，我希望得到一些建议来优化它，使其运行更快。

- E199504

看一下椭圆包络(EllipticEnvelope)或者孤立森林(IsolationForest)，它们都是用于异常/离群值检测的非常快速的算法。 - Sergey Bushmanov

@Sergey Bushmanov，我会尝试这另外两个算法。关于这个问题，你能告诉我如何改进以使其运行速度更快吗？ - E199504

我不熟悉pyod（od是异常检测的意思吗？），但sklearn的SVM除了rbf核之外还有其他的核函数。我建议先从linear核开始尝试，看看是否满足您的要求，然后再尝试更复杂的核函数。关于算法方面，我建议首先尝试理解一维分布中何为离群值（它是一维的，对吗？）。如果是正态分布，则计算σ并查看距平均值2-3σ以外的值就足够了。即使只使用一个包络线也足够了。如果不是正态分布，则需要研究该类型分布中何为离群值。 - Sergey Bushmanov

@Sergey Bushmanov，如果我选择使用“linear”核函数，我需要更改哪些行？ - E199504

OCSVM(kernel='linear' - Sergey Bushmanov

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jon Nordby · Accepted Answer

SVM的训练时间随样本数量的增加而呈O(n^2)或更糟糕的规模增长，因此不适用于具有数百万个样本的数据集。探索该问题的一些示例代码可以在这里找到。

我建议尝试使用IsolationForest，它速度快且性能良好。

如果您想使用SVM，请对数据集进行子采样，使其具有10-100k个样本。线性核函数的训练速度也比RBF要快得多，但仍然无法扩展到大量样本。