何时使用TensorFlow数据集API而不是Pandas或NumPy？

Question

何时使用TensorFlow数据集API而不是Pandas或NumPy？

csvtensorflowpreprocessortensorflow-datasets

20

我看到了许多使用LSTM处理tensorflow时间序列的指南，但是我仍然不确定当前最佳实践是关于读取和处理数据的 - 特别是当一个人应该使用tf.data.Dataset API时。

在我的情况下，我有一个名为data.csv的文件，其中包含我的features，我想执行以下两个任务：

Compute targets - the target at time t is the percent change of some column at some horizon, i.e.,
```
labels[i] = features[i + h, -1] / features[i, -1] - 1
```
I would like h to be a parameter here, so I can experiment with different horizons.
Get rolling windows - for training purposes, I need to roll my features into windows of length window:
```
train_features[i] = features[i: i + window]
```

我可以完美地使用 pandas 或者 numpy 构建这些对象，所以我不是在询问如何在一般情况下实现这个-我的问题是特别想知道在 tensorflow 中这样的管道应该是什么样子。

编辑：我想知道我列出的这两个任务是否适合数据集 API，或者是否最好使用其他库来处理它们？

- ira

1

尽可能使用tf.data.Datasets API，因为Tensorflow能更好地优化读写操作，而不会限制训练时间。 - Evan Weissburg

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Maxim · Accepted Answer

首先，请注意，您可以像教程中所述一样使用数据集API与pandas或numpy数组一起使用：

如果您的所有输入数据都适合内存，则从它们创建数据集的最简单方法是将它们转换为tf.Tensor对象并使用Dataset.from_tensor_slices()

一个更有趣的问题是，您应该使用会话feed_dict还是通过Dataset方法组织数据管道。正如评论中已经说明的那样，数据集API更有效率，因为数据直接流向设备，绕过了客户端。来自性能指南：

While feeding data using a feed_dict offers a high level of flexibility, in most instances using feed_dict does not scale optimally. However, in instances where only a single GPU is being used the difference can be negligible. Using the Dataset API is still strongly recommended. Try to avoid the following:
# feed_dict often results in suboptimal performance when using large inputs  
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

但是，正如他们自己所说，这种差异可能是微不足道的，GPU仍然可以完全利用普通的feed_dict输入。当训练速度不是关键时，没有区别，使用任何你感觉舒适的流水线。当速度很重要且有大量的训练集时，数据集API似乎是更好的选择，特别是你计划进行分布式计算。

数据集API与文本数据（如CSV文件）配合得很好，请查看数据集教程中的此部分。