我正在尝试使用Tensorflow的tf.contrib.training.stratified_sample
来平衡类别。我制作了一个快速示例来测试它,以平衡地从两个不平衡的类别中抽取样本并进行验证,但是我遇到了错误。
import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes
batch_size = 10
data = ['a']*9990+['b']*10
labels = [1]*9990+[0]*10
data_tensor = ops.convert_to_tensor(data, dtype=dtypes.string)
label_tensor = ops.convert_to_tensor(labels)
target_probs = [0.5,0.5]
data_batch, label_batch = tf.contrib.training.stratified_sample(
data_tensor, label_tensor, target_probs, batch_size,
queue_capacity=2*batch_size)
with tf.Session() as sess:
d,l = sess.run(data_batch,label_batch)
print('percentage "a" = %.3f' % (np.sum(l)/len(l)))
我遇到的错误是:
Traceback (most recent call last):
File "/home/jason/code/scrap.py", line 56, in <module>
test_stratified_sample()
File "/home/jason/code/scrap.py", line 47, in test_stratified_sample
queue_capacity=2*batch_size)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/contrib/training/python/training/sampling_ops.py", line 191, in stratified_sample
with ops.name_scope(name, 'stratified_sample', tensors + [labels]):
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/math_ops.py", line 829, in binary_op_wrapper
y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 676, in convert_to_tensor
as_ref=False) File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 741, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
return constant(v, dtype=dtype, name=name)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_util.py", line 374, in make_tensor_proto
_AssertCompatible(values, dtype)
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
(dtype.name, repr(mismatch), type(mismatch).__name__)) TypeError: Expected string, got list containing Tensors of type '_Message' instead.
错误信息没有说明我做错了什么。我还尝试过将原始数据和标签放入(而不是转换为张量),以及尝试使用
tf.train.slice_input_producer
来创建数据和标签张量的初始队列。有人成功使用
stratified_sample
吗?我没有找到任何示例。
data = ['a']*9990+['b']*10 labels = [1]*9990+[0]*10
改为data = ['a']*9000+['b']*1000 labels = [1]*9000+[0]*1000
在你的代码中,它会出现错误并且只会生成类1的示例(“a”)。你的代码按照发布的方式工作,但我无法弄清楚为什么上述更改(显然使其更加真实,因为批量大小远小于任何一类中的数量)会导致它出错。这也更加平衡,但结果完全不平衡。 - Jasonstratified_sample
不会_shuffle,它只是接受/拒绝。因此,如果输入不是随机顺序,输出也将如此。我已经添加了一个shuffle步骤到示例中,并设置了高的min_after_dequeue
来确保在采样之前数据被洗牌。即使存在更高的不平衡性,这也是一个问题,只是因为大多数类别中有很多被丢弃而被隐藏了起来。 - Allen Lavoieshuffled_data, shuffled_labels = tf.train.shuffle_batch(...)
替换为shuffled_data,shuffled_labels = tf.train.slice_input_producer([data_tensor, label_tensor], shuffle=True, capacity=3*batch_size)
并将enqueue_many设置为false。这样做更快(约9秒 vs. 120秒),因为它拒绝单个示例而不是整个批次。 - Jason