如何从特定的目录或文件夹中导入预先下载的MNIST数据集?

12

我已从LeCun网站下载了MNIST数据集。我想要编写Python代码,以便直接从目录提取gzip并读取数据集,这意味着我不再需要下载或访问MNIST网站。

期望的过程: 访问文件夹/目录 --> 提取gzip --> 读取数据集(one hot编码)

如何做到呢?由于几乎所有的教程都需要访问LeCun或Tensorflow网站来下载和读取数据集。谢谢!


2
你应该先将gzip文件解压到本地电脑上,然后使用scipy.misc.imread或opencv读取图像到Python中。 - yuji
你尝试过什么吗? - Vivek Kumar
是的,我尝试删除了“from tensorflow.examples.tutorials.mnist import input_data”这行代码。但它仍然从网站下载数据集。我还在思考为什么即使留下了“mnist = input_data.read_data_sets('mnist_data/', one_hot=True)”这行代码,它仍然会访问并下载数据集。 - Joshua
3个回答

10
如果您已经提取了MNIST数据,那么您可以直接使用NumPy进行低级加载:
def loadMNIST( prefix, folder ):
    intType = np.dtype( 'int32' ).newbyteorder( '>' )
    nMetaDataBytes = 4 * intType.itemsize

    data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
    magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
    data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )

    labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
                          dtype = 'ubyte' )[2 * intType.itemsize:]

    return data, labels

trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )

要转换为热编码:

def toHotEncoding( classification ):
    # emulates the functionality of tf.keras.utils.to_categorical( y )
    hotEncoding = np.zeros( [ len( classification ), 
                              np.max( classification ) + 1 ] )
    hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
    return hotEncoding

trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )

9
这个TensorFlow调用
from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')

如果您已经在那里有相应的文件,它就不会下载任何东西。

但如果您希望自己解压,则可以按照以下步骤进行:

from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels

with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
  train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
  train_labels = extract_labels(f)

with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
  test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
  test_labels = extract_labels(f)

如果你有时间,可以看一下这些问题:[https://stackoverflow.com/questions/64085547/mnist-datasets-from-google-drive-folder-showing-datasets-not-found] 和 [https://stackoverflow.com/questions/64080130/how-to-load-training-data-including-label-data-ubyte-format-of-images-from-loc]。 - mostafiz67

4
我将展示如何从头开始加载,以便更好地理解,并展示如何使用matplotlib.pyplot显示数字图像。
import cPickle
import gzip
import numpy as np
import matplotlib.pyplot as plt

def load_data():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    # get the first image and it's label
    img1_arr, img1_label = X_train[0], y_train[0]
    print img1_arr.shape, img1_label
    # (784L,) , 5

    # reshape first image(1 D vector) to 2D dimension image
    img1_2d = np.reshape(img1_arr, (28, 28))
    # show it
    plt.subplot(111)
    plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
    plt.show()

您可以使用以下示例函数将标签向量化为一个10维单位向量

enter image description here

def vectorized_result(label):
    e = np.zeros((10, 1))
    e[label] = 1.0
    return e

将上述标签向量化:
print vectorized_result(img1_label)
# output as below:
[[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 1.]
 [ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]

如果您想将其翻译为CNN输入,可以将其重新塑形如下:
def load_data_v2():
    path = '../../data/mnist.pkl.gz'
    f = gzip.open(path, 'rb')
    training_data, validation_data, test_data = cPickle.load(f)
    f.close()

    X_train, y_train = training_data[0], training_data[1]
    print X_train.shape, y_train.shape
    # (50000L, 784L) (50000L,)

    X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
    y_train = np.array([vectorized_result(item) for item in y_train])

    print X_train.shape, y_train.shape
    # (50000L, 28L, 28L) (50000L, 10L, 1L)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接