TensorFlow 分层抽样错误

Question

TensorFlow 分层抽样错误

pythonpython-3.xmachine-learningtensorflow

4

我正在尝试使用Tensorflow的tf.contrib.training.stratified_sample来平衡类别。我制作了一个快速示例来测试它，以平衡地从两个不平衡的类别中抽取样本并进行验证，但是我遇到了错误。

import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes

batch_size = 10
data = ['a']*9990+['b']*10
labels = [1]*9990+[0]*10
data_tensor = ops.convert_to_tensor(data, dtype=dtypes.string)
label_tensor = ops.convert_to_tensor(labels)
target_probs = [0.5,0.5]
data_batch, label_batch = tf.contrib.training.stratified_sample(
    data_tensor, label_tensor, target_probs, batch_size,
    queue_capacity=2*batch_size)

with tf.Session() as sess:
    d,l = sess.run(data_batch,label_batch)
print('percentage "a" = %.3f' % (np.sum(l)/len(l)))

我遇到的错误是：

Traceback (most recent call last):   
File "/home/jason/code/scrap.py", line 56, in <module>
    test_stratified_sample()   
File "/home/jason/code/scrap.py", line 47, in test_stratified_sample
    queue_capacity=2*batch_size)   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/contrib/training/python/training/sampling_ops.py", line 191, in stratified_sample
    with ops.name_scope(name, 'stratified_sample', tensors + [labels]):   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/ops/math_ops.py", line 829, in binary_op_wrapper
    y = ops.convert_to_tensor(y, dtype=x.dtype.base_dtype, name="y")   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 676, in convert_to_tensor
    as_ref=False)   File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/ops.py", line 741, in internal_convert_to_tensor
    ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/constant_op.py", line 113, in _constant_tensor_conversion_function
    return constant(v, dtype=dtype, name=name)   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/constant_op.py", line 102, in constant
    tensor_util.make_tensor_proto(value, dtype=dtype, shape=shape, verify_shape=verify_shape))   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_util.py", line 374, in make_tensor_proto
    _AssertCompatible(values, dtype)   
File "/usr/local/lib/python3.4/dist-packages/tensorflow/python/framework/tensor_util.py", line 302, in _AssertCompatible
    (dtype.name, repr(mismatch), type(mismatch).__name__)) TypeError: Expected string, got list containing Tensors of type '_Message' instead.

错误信息没有说明我做错了什么。我还尝试过将原始数据和标签放入（而不是转换为张量），以及尝试使用tf.train.slice_input_producer来创建数据和标签张量的初始队列。

有人成功使用stratified_sample吗？我没有找到任何示例。

- Jason

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Allen Lavoie · Accepted Answer

我已将代码修改为适合我使用的形式。以下是修改摘要:

使用enqueue_many=True将具有不同标签的一批示例推入队列。否则，它将期望单个标量标签张量(当由队列运行程序评估时可能是随机的)。
第一个参数应该是张量列表。它应该有更好的错误消息(我认为这是您遇到的问题)。请创建一个拉取请求或在Github上打开一个问题以获得更好的错误消息。
开始队列运行器。否则使用队列的代码将出现死锁。或者使用Estimator或MonitoredSession，这样您就不必担心这个问题。
(基于评论编辑)stratified_sample不会对数据进行洗牌，它只会接受/拒绝!因此，如果您的数据没有随机化，请考虑在采样之前通过slice_input_producer(enqueue_many=False)或shuffle_batch(enqueue_many=True)将其放入随机顺序。

修改后的代码（基于Jason的评论进行改进）:

import numpy
import tensorflow as tf
from tensorflow.python.framework import ops
from tensorflow.python.framework import dtypes

with tf.Graph().as_default():
  batch_size = 100
  data = ['a']*9000+['b']*1000
  labels = [1]*9000+[0]*1000
  data_tensor = ops.convert_to_tensor(data, dtype=dtypes.string)
  label_tensor = ops.convert_to_tensor(labels, dtype=dtypes.int32)
  shuffled_data, shuffled_labels = tf.train.slice_input_producer(
      [data_tensor, label_tensor], shuffle=True, capacity=3*batch_size)
  target_probs = numpy.array([0.5,0.5])
  data_batch, label_batch = tf.contrib.training.stratified_sample(
      [shuffled_data], shuffled_labels, target_probs, batch_size,
      queue_capacity=2*batch_size)

  with tf.Session() as session:
    tf.local_variables_initializer().run()
    tf.global_variables_initializer().run()
    coordinator = tf.train.Coordinator()
    tf.train.start_queue_runners(session, coord=coordinator)
    num_iter = 10
    sum_ones = 0.
    for _ in range(num_iter):
      d, l = session.run([data_batch, label_batch])
      count_ones = l.sum()
      sum_ones += float(count_ones)
      print('percentage "a" = %.3f' % (float(count_ones) / len(l)))
    print('Overall: {}'.format(sum_ones / (num_iter * batch_size)))
    coordinator.request_stop()
    coordinator.join()

输出：

percentage "a" = 0.480
percentage "a" = 0.440
percentage "a" = 0.580
percentage "a" = 0.570
percentage "a" = 0.580
percentage "a" = 0.520
percentage "a" = 0.480
percentage "a" = 0.460
percentage "a" = 0.390
percentage "a" = 0.530
Overall: 0.503