在不平衡的分类问题中,有许多构建权重的选择。其中最常见的一种是直接使用训练中的类别计数来估算样本权重。这个选项可以通过 sklearn 轻松计算。'balanced' 模式使用 y 的值自动调整权重,成反比于类别频率。
下面的示例中,我们尝试将 compute_sample_weight
方法融入到 DNNClassifier 的拟合中。作为标签分布,我使用了与问题中相同的分布。
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.utils.class_weight import compute_sample_weight
train_size = 1000
test_size = 200
columns = 30
y_train = np.random.choice([0,1,2,3], train_size, p=[0.15, 0.35, 0.28, 0.22])
x_train = pd.DataFrame(np.random.uniform(0,1, (train_size,columns)).astype('float32'))
x_train.columns = [str(i) for i in range(columns)]
weight = compute_sample_weight(class_weight='balanced', y=y_train)
x_train['weight'] = weight.astype('float32')
y_test = np.random.choice([0,1,2,3], test_size, p=[0.15, 0.35, 0.28, 0.22])
x_test = pd.DataFrame(np.random.uniform(0,1, (test_size,columns)).astype('float32'))
x_test.columns = [str(i) for i in range(columns)]
x_test['weight'] = np.ones(len(y_test)).astype('float32')
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(x_train), y_train))
dataset = dataset.shuffle(1000).repeat().batch(10)
return dataset
def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((dict(x_test), y_test))
return dataset.shuffle(1000).repeat().batch(10)
classifier = tf.estimator.DNNClassifier(
feature_columns=[tf.feature_column.numeric_column(str(i), shape=[1]) for i in range(columns)],
weight_column = tf.feature_column.numeric_column('weight'),
hidden_units=[10],
n_classes=4,
)
classifier.train(input_fn=lambda: train_input_fn(), steps=100)
eval_results = classifier.evaluate(input_fn=eval_input_fn, steps=1)
考虑到我们的权重是基于目标函数构建的,因此在测试数据中我们必须将它们设置为1,因为标签是未知的。