我有一个不平衡的多分类数据集,想要使用fit_generator
中的class_weight
参数,根据每个类别的图像数量给类别赋权重。我正在使用ImageDataGenerator.flow_from_directory
从目录中加载数据集。
是否可以直接从ImageDataGenerator
对象推断出class_weight
参数?
我有一个不平衡的多分类数据集,想要使用fit_generator
中的class_weight
参数,根据每个类别的图像数量给类别赋权重。我正在使用ImageDataGenerator.flow_from_directory
从目录中加载数据集。
是否可以直接从ImageDataGenerator
对象推断出class_weight
参数?
刚刚想出了一种实现这个的方法。
from collections import Counter
train_datagen = ImageDataGenerator()
train_generator = train_datagen.flow_from_directory(...)
counter = Counter(train_generator.classes)
max_val = float(max(counter.values()))
class_weights = {class_id : max_val/num_images for class_id, num_images in counter.items()}
model.fit_generator(...,
class_weight=class_weights)
train_generator.classes
是每张图像所属类别的列表。Counter(train_generator.classes)
创建了一个计数器,用于统计每个类别中图像的数量。
请注意,这些权重可能不适合收敛,但您可以将其用作基于出现频率的其他类型的加权的基础。
本答案灵感来源于:https://github.com/fchollet/keras/issues/1875#issuecomment-273752868
train_generator.classes
等于[1, 1, 0]
。 - Fábio Perez或者,您可以简单地执行:
from sklearn.utils import class_weight
import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
您可以按照上面的评论设置如下内容:model.fit_generator(..., class_weight=class_weights)
sklearn.utils.class_weight
的准确性更高,但我不确定原因。它们并不产生相同的类权重。from sklearn.utils import class_weight
import numpy as np
class_weights = dict(zip(np.unique(traingen.classes),class_weight.compute_class_weight(
class_weight = 'balanced',
classes = np.unique(traingen.classes),
y = traingen.classes)))
from sklearn.utils import class_weight import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes)
train_class_weights = dict(enumerate(class_weights))
model.fit_generator(..., class_weight=train_class_weights)
或者,您可以简单地执行以下操作:
from sklearn.utils import class_weight import numpy as np
class_weights = class_weight.compute_class_weight(
'balanced',
np.unique(train_generator.classes),
train_generator.classes) You can then set (as per comment above):
model.fit_generator(..., class_weight=class_weights)
正如这里所建议的,分配类别权重的好方法是使用:
(1 / class_count) * (total_count/2)
counter = Counter(train_generator.classes)
total = float(sum(counter.values()))
class_weight = {class_id : (1/num_images)*(total)/2.0 for class_id, num_images in counter.items()}
2023年4月版本。最终使用了这个:
from sklearn.utils.class_weight import compute_class_weight
unique_classes = np.unique(ds_train.classes)
# "If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount(y))."
class_weights = compute_class_weight("balanced", classes=unique_classes, y=ds_train.classes)
class_weight = {class_id: weight for class_id, weight in zip(unique_classes, class_weights)}
model.fit(..., class_weight=class_weight)