Keras总是预测相同的输出

Question

Keras总是预测相同的输出

pythonmachine-learningkerasprediction

5

Keras将始终为我提供的每个输入预测相同的类别。目前有四个类别：新闻、天气、体育和经济。

训练集由许多不同的文本组成，其中类别与其主题相同。分类为新闻和体育的文本比天气和经济的文本要多得多。

新闻：12112个文本天气：1685个文本体育：13669个文本经济：1282个文本

我本来期望模型会偏向于体育和新闻，但实际上它完全偏向于天气，每个输入都至少以80%的置信度被分类为天气。

只是为了增加我的困惑：在训练注释器时，准确度得分可以达到95%到100%（真的！）。我猜我在这里做了一些非常愚蠢的事情，但我不知道是什么。

下面是我调用分类器的代码。它在Windows PC上的Python 3上运行。

with open('model.json') as json_data:
model_JSON = json.load(json_data)

model_JSON = json.dumps(model_JSON) 
model = model_from_json(model_JSON)

model.load_weights('weights.h5')

text = str(text.decode())   
encoded = one_hot(text, max_words, split=" ")

tokenizer = Tokenizer(num_words=max_words)
matrix = tokenizer.sequences_to_matrix([encoded], mode='binary')

result = model.predict(matrix)

legende = ["News", "Wetter", "Sport", "Wirtschaft"]
print(str(legende))
print(str(result))

cat = numpy.argmax(result)  
return str(legende[cat]).encode()

以下是我训练分类器的步骤。此处省略了从数据库获取数据的部分。这一过程在Linux VM上完成。我已经尝试过更改损失和激活函数，但没有任何效果。另外，我目前正在尝试使用更多的epochs，但到目前为止也没有起到帮助作用。

max_words = 10000
batch_size=32
epochs=15

rows = cursor.fetchall()

X = []
Y = []

# Einlesen der Rows
for row in rows:
    X.append(row[5])
    Y.append(row[1])

num_classes = len(set(Y))
Y = one_hot("$".join(Y), num_classes, split="$")


for i in range(len(X)):
    X[i] = one_hot(str(X[i]), max_words, split=" ")

split = round(len(X) * 0.2)     

x_test = np.asarray(X[0:int(split)])
y_test = np.asarray(Y[0:int(split)])

x_train = np.asarray(X[int(split):len(X)])
y_train = np.asarray(Y[int(split):len(X)])

print('x_test shape', x_test.shape)
print('y_test shape', y_test.shape)

print(num_classes, 'classes')

#vektorisieren
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

#klassenvektor zu binärer klassenmatrix
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

#model erstellen
model = Sequential()

model.add(Dense(512, input_shape=(max_words,)))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy'])


history = model.fit(x_train, y_train,
    batch_size=batch_size,
    epochs=epochs,
    verbose=1,
    validation_split=0.1
    )

score = model.evaluate(x_test, y_test,
    batch_size=batch_size, 
    verbose=1
    )

print('Test score', score[0])
print('Test accuracy', score[1])

#write model to json
print("writing model to json")
model_json = model.to_json()
with open("model.json", 'w') as json_file:
    json_file.write(model_json)

#save weights as hdf5
print("saving weights to hdf5")
model.save_weights("weights.h5")

- Junge

很有可能你的模型确实是在预测所有的运动。你确定你正确地解释了这四个类别的顺序吗？你可能在y_train和legende之间颠倒了一些东西。 - Daniel Möller

这是可能的。但让我困惑的是分类器的准确率接近100％。无论如何，我将首先去适当地规范化数据。值得一试。 - Junge

在fit之前立即计算y_train中有多少元素属于哪个类别可能会很有趣。 - Daniel Möller

@Junge，我对Tokenizer很有经验，但也许你正在尝试基于batch进行预测，因此结果包含了第一个批次（32个默认大小）的predictions？result变量的形状是什么？ - oak

1

好的，发生了一件有趣的事情。我在Y变量上使用了one_hot。由于one_hot不是碰撞自由的，发生了碰撞。他把除了“天气”以外的一切编码为1，“天气”编码为2。现在他可以可靠地预测一切都是2。我已经解决了这个问题，现在他几乎总是预测“新闻”，但这是我可能会控制的问题。 - Junge

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Junge · Accepted Answer

感谢 @Daniel Möller 给我的提示，我发现了问题所在。他的建议是查看训练集中每个类别的实例数量。

在我的情况下，我发现使用 One_Hot 对类别进行哈希不太明智，因为它有时会使用相同的数字编码多个类别。对于我来说，One_Hot 几乎将所有内容都编码为 1。这样 Keras 就只学会预测 1。