Keras BERT-高准确率、验证准确率、F1值、AUC值，但预测结果差

Question

Keras BERT-高准确率、验证准确率、F1值、AUC值，但预测结果差

3

我使用tf.keras通过谷歌BERT训练了一个文本分类器。

我的数据集包含50,000行数据，平均分布在5个标签上。这是一个更大数据集的子集，但我选择这些特定的标签，因为它们彼此完全不同，以尝试避免训练时的混淆。

我按照以下方式创建数据拆分：

train, test = train_test_split(df, test_size=0.30, shuffle=True, stratify=df['label'], random_state=10)
train, val = train_test_split(train, test_size=0.1, shuffle=True, stratify=train['label'], random_state=10)

这个模型的设计如下：

def compile():
    mirrored_strategy = tf.distribute.MirroredStrategy()
    with mirrored_strategy.scope():
        learn_rate = 4e-5
        bert = 'bert-base-uncased'
        model = TFBertModel.from_pretrained(bert, trainable=False)

        input_ids_layer = Input(shape=(512,), dtype=np.int32)
        input_mask_layer = Input(shape=(512,), dtype=np.int32)

        bert_layer = model([input_ids_layer, input_mask_layer])[0]

        X = tf.keras.layers.GlobalMaxPool1D()(bert_layer)

        output = Dense(5)(X)
        output = BatchNormalization(trainable=False)(output)
        output = Activation('softmax')(output)

        model_ = Model(inputs=[input_ids_layer, input_mask_layer], outputs=output)

        optimizer = tf.keras.optimizers.Adam(4e-5)
        loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
        metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')

        model_.compile(optimizer=optimizer, loss=loss, metrics=[metric])
        return model_

这将得到以下结果：

loss: 1.2433
accuracy: 0.8024
val_loss: 1.2148
val_accuracy: 0.8300
f1_score: 0.8283
precision: 0.8300
recall: 0.8286
auc: 0.9676

当我运行测试数据时，将经过独热编码的标签转换回其原始标签（使用model.load_weights()）...

test_sample = [test_dataset[0],test_dataset[1], test_dataset[2]]
predictions = tf.argmax(model.predict(test_sample[:2]), axis =1)
preds_inv = le.inverse_transform(predictions)
true_inv = le.inverse_transform(test_sample[2])

...混淆矩阵中的数值杂乱无章:

confusion_matrix(true_inv, inv_preds)

array([[ 967,  202,    7,  685, 1139],
       [ 474,  785,   27,  717,  997],
       [ 768,  372,   46, 1024,  790],
       [ 463,  426,   27, 1272,  812],
       [ 387,  224,   11,  643, 1735]])

有趣的是，第三个标签几乎不被预测。

请注意，在批量归一化中，我将trainable设置为False，但在训练期间，它会被设置为True。

输入数据由两个数组组成：文本字符串的数值向量表示（嵌入）和用于识别每个字符串的512个元素中哪些是填充值的填充令牌。

在使用深度预训练模型（BERT）对均衡数据集进行训练时，给出合理的准确度分数，但得到可怕的预测结果，可能的原因有哪些？

- ML_Engine

你是否对训练模型的标签进行独热编码？ - desertnaut

是的 - 运行 inverse_transform 正确地工作 - 将它们转换回原始标签（对于真实标签和预测标签）。 - ML_Engine

有趣的问题。这可能是一个冒险，但是你如何创建你的测试集？如果它由于不同的分布而被描述，那么你学到的模型可能会表现出你所描述的行为。 - Gianluca Micchi

感谢@GianlucaMicchi。我使用sklearn的train_test_split函数。测试集代表总数据的30%，并进行分层抽样以确保所有标签的比例被捕获，并应用了随机状态。（已更新我的问题以添加此信息） - ML_Engine

谢谢@igrinis - 这是可能的，我会尝试一下。但是，将来在生产环境中需要使用模型时，我该如何解决这个问题？ - ML_Engine

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ML_Engine · Accepted Answer

在我的情况下，我通过使用词云来调查导致混淆的两个标签的内容来解决了这个问题。以下示例显示了我针对其中一个标签的代码：

from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
% matplotlib inline

df1 = df[df['label']==48000000]
text = " ".join(review for review in df1.text)
wordcloud = WordCloud().generate(text)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

现在，我理解BERT应该能够识别哪些单词对于特定标签是重要的（使用类似TF-IDF的算法？不确定），然而，当我使用NLTK删除停用词，并将我认为适用于我的特定数据集的单词添加到列表中，例如'system'、'service'（等等），重新训练模型后，准确度显著提高。

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def preprocess_text(sentence):

    # Convert to lowercase
    sentence = sentence.lower()

    new_stopwords = ['service','contract','solution','county','supplier',
             'district','council','borough','management',
             'provider','provision'
              'project','contractor']

    stop_words = set(stopwords.words('english'))
    stop_words.update(new_stopwords)
    sentence = [w for w in sentence.split(" ") if not w in stop_words]
    sentence = ' '.join(w for w in sentence)
return sentence

df['text'] = df['text'].apply(preprocess_text)