Tensorflow读取CSV文件 - 最佳方法是什么?

4

我一直在尝试不同的方法来读取一个包含97K行,每行500个特征(大约100 MB)的CSV文件。

我的第一种方法是使用numpy将所有数据都读入内存:

raw_data = genfromtxt(filename, dtype=numpy.int32, delimiter=',')

这个命令运行时间太长了,所以我需要找到更好的读取文件的方法。

第二种方法是参照以下指南:https://www.tensorflow.org/programmers_guide/reading_data

我注意到的第一件事是每个epoch的运行时间都要长得多。由于我正在使用随机梯度下降,这可以解释为每个批次都需要从文件中读取。

是否有一种方法来优化这第二种方法呢?

我的代码(第二种方法):

reader = tf.TextLineReader()
filename_queue = tf.train.string_input_producer([filename])
_, csv_row = reader.read(filename_queue) # read one line
data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)

labels = data[0]
features = data[labelsSize:labelsSize+featuresSize]

# minimum number elements in the queue after a dequeue, used to ensure 
# that the samples are sufficiently mixed
# I think 10 times the BATCH_SIZE is sufficient
min_after_dequeue = 10 * batch_size

# the maximum number of elements in the queue
capacity = 20 * batch_size

# shuffle the data to generate BATCH_SIZE sample pairs
features_batch, labels_batch = tf.train.shuffle_batch([features, labels], batch_size=batch_size, num_threads=10, capacity=capacity, min_after_dequeue=min_after_dequeue)

* * * *

coordinator = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coordinator)

try:
 # And then after everything is built, start the training loop.
 for step in xrange(max_steps):
  global_step = step + offset_step
  start_time = time.time()

  # Run one step of the model.  The return values are the activations
  # from the `train_op` (which is discarded) and the `loss` Op.  To
  # inspect the values of your Ops or variables, you may include them
  # in the list passed to sess.run() and the value tensors will be
  # returned in the tuple from the call.
  _, __, loss_value, summary_str = sess.run([eval_op_train, train_op, loss_op, summary_op])

except tf.errors.OutOfRangeError:
  print('Done training -- epoch limit reached')
finally:
  coordinator.request_stop()

# Wait for threads to finish.
coordinator.join(threads)
sess.close()

你可以在从CSV文件中读取和格式化原始数据后尝试使用pickle。这可能会更快(但我不能保证)。 - Ashoka Lella
请看这里:https://dev59.com/q1oU5IYBdhLWcg3wzZE1 - loretoparisi
你能把 min_after_dequeue 减少到100,看看性能如何吗? - Harsha Pokkalla
1个回答

2
一种解决方法是使用 TFRecords 将数据转换为 tensorflow 二进制格式。 请参阅 TensorFlow 数据输入(第1部分):占位符、Protobufs 和队列, 以及将 CSV 文件转换为 TFRecords 的代码段 this
csv = pandas.read_csv("your.csv").values
with tf.python_io.TFRecordWriter("csv.tfrecords") as writer:
    for row in csv:
        features, label = row[:-1], row[-1]
        example = tf.train.Example()
        example.features.feature["features"].float_list.value.extend(features)
        example.features.feature["label"].int64_list.value.append(label)
        writer.write(example.SerializeToString())

在从本地文件系统(或者实际应用中从像AWS S3,HDFS等远程存储)流式传输非常大的文件时,使用 Gensim smart_open Python库会很有帮助。
    # stream lines from an S3 object
    for line in smart_open.smart_open('s3://mybucket/mykey.txt'):
           print line

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接