如何解压pkl文件？

Question

如何解压pkl文件？

pythonpickledeep-learningmnist

136

我有一个来自MNIST数据集的pkl文件，其中包含手写数字图像。

我想要查看这些数字图像中的每一个，因此我需要解压缩pkl文件，但我不知道该怎么做。

有没有一种方法可以解压/解包pkl文件？

- ytrewq

4个回答

11

方便的一行代码

pkl() (
  python -c 'import pickle,sys;d=pickle.load(open(sys.argv[1],"rb"));print(d)' "$1"
)
pkl my.pkl

将会打印序列化对象的__str__。

可视化一个对象的通用问题当然是未定义的，所以如果__str__不够用，你需要一个自定义的脚本，@dataclass+pprint可能会有兴趣：是否有内置函数打印对象的所有属性和值？

大规模直接提取MNIST-idx3-ubyte.gz文件到PNG

你也可以轻松地从http://yann.lecun.com/exdb/mnist/下载官方数据集文件，并按照以下方式扩展为PNG：

该脚本使用自https://github.com/myleott/mnist_png的脚本。

相关：如何将我的数据集放入与“mnist.pkl.gz”中使用的确切格式和数据结构相同的.pkl文件中？

- Ciro Santilli OurBigBook.com

2

需要使用pickle模块（如果文件被压缩，还需要使用gzip模块）。

注意：这些已经是Python标准库中的内容，不需要安装任何新东西。

- crabman84

2

如果您想使用原始的MNIST文件进行工作，以下是如何反序列化它们。

如果您尚未下载这些文件，请在终端中运行以下命令：

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

然后将以下内容保存为deserialize.py并运行它。

import numpy as np
import gzip

IMG_DIM = 28

def decode_image_file(fname):
    result = []
    n_bytes_per_img = IMG_DIM*IMG_DIM

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[16:]

        if len(data) % n_bytes_per_img != 0:
            raise Exception('Something wrong with the file')

        result = np.frombuffer(data, dtype=np.uint8).reshape(
            len(bytes_)//n_bytes_per_img, n_bytes_per_img)

    return result

def decode_label_file(fname):
    result = []

    with gzip.open(fname, 'rb') as f:
        bytes_ = f.read()
        data = bytes_[8:]

        result = np.frombuffer(data, dtype=np.uint8)

    return result

train_images = decode_image_file('train-images-idx3-ubyte.gz')
train_labels = decode_label_file('train-labels-idx1-ubyte.gz')

test_images = decode_image_file('t10k-images-idx3-ubyte.gz')
test_labels = decode_label_file('t10k-labels-idx1-ubyte.gz')

脚本没有像pickled文件那样对像素值进行标准化处理。要做到这一点，你只需要：

train_images = train_images/255
test_images = test_images/255

- osolmaz

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Peque · Accepted Answer

一般情况下

你的pkl文件实际上是一个序列化的pickle文件，这意味着它已经使用Python的pickle模块进行了转储。

要取消pickle数据的序列化，你可以：

import pickle


with open('serialized.pkl', 'rb') as f:
    data = pickle.load(f)

MNIST数据集

请注意，gzip 只在文件被压缩时才需要：

import gzip
import pickle


with gzip.open('mnist.pkl.gz', 'rb') as f:
    train_set, valid_set, test_set = pickle.load(f)

每个集合都可以进一步划分（例如针对训练集）：

train_x, train_y = train_set

这些将是你的数据集的输入（数字）和输出（标签）。

如果你想要显示这些数字：

import matplotlib.cm as cm
import matplotlib.pyplot as plt


plt.imshow(train_x[0].reshape((28, 28)), cmap=cm.Greys_r)
plt.show()

mnist_digit

另一个选择是查看原始数据：

http://yann.lecun.com/exdb/mnist/

但这将更加困难，因为您需要创建一个程序来读取这些文件中的二进制数据。所以我建议您使用 Python，并使用 pickle 加载数据。正如您所看到的那样，这非常容易。;-)