Pytorch。Dataloader中的pin_memory是如何工作的？

Question

Pytorch。Dataloader中的pin_memory是如何工作的？

57

我想理解Dataloader中的pin_memory是如何工作的。

根据文档：

pin_memory (bool, optional) – If True, the data loader will copy tensors into CUDA pinned memory before returning them.

以下是一个自包含的代码示例。

import torchvision
import torch

print('torch.cuda.is_available()', torch.cuda.is_available())
train_dataset = torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor())
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=True)
x, y = next(iter(train_dataloader))
print('x.device', x.device)
print('y.device', y.device)

生成以下输出：

torch.cuda.is_available() True
x.device cpu
y.device cpu

但我预期会得到这样的结果，因为我在Dataloader中指定了标志pin_memory=True。

torch.cuda.is_available() True
x.device cuda:0
y.device cuda:0

同时我也进行了一些基准测试：

import torchvision
import torch
import time
import numpy as np

pin_memory=True
train_dataset =torchvision.datasets.CIFAR10(root='cifar10_pytorch', download=True, transform=torchvision.transforms.ToTensor())
train_dataloader = torch.utils.data.DataLoader(train_dataset, batch_size=64, pin_memory=pin_memory)
print('pin_memory:', pin_memory)
times = []
n_runs = 10
for i in range(n_runs):
    st = time.time()
    for bx, by in train_dataloader:
        bx, by = bx.cuda(), by.cuda()
    times.append(time.time() - st)
print('average time:', np.mean(times))

我得到了以下结果。

pin_memory: False
average time: 6.5701503753662

pin_memory: True
average time: 7.0254474401474

pin_memory=True 只会使事情变得更慢。有人能解释一下这种行为吗？

- Ivan Belonogov

我已经编辑了我的答案以回应你的基准测试。下次请留下评论，因为只有偶然我才注意到你的问题已经更新。 - Jatentaki

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jatentaki · Accepted Answer

文档可能过于简洁，因为使用的术语相当专业。在CUDA术语中，固定内存并不是GPU内存，而是非分页CPU内存。这里提供了相关的好处和原理，但要点是此标志允许x.cuda()操作（您仍然必须像通常一样执行）避免一次隐式的CPU到CPU复制，从而使其更具性能。此外，使用固定内存张量，您可以使用x.cuda(non_blocking=True)相对于主机异步执行复制。在某些情况下，这可能会导致性能提高，即如果您的代码结构如下：

x.cuda(non_blocking=True)
执行一些CPU操作
使用x执行GPU操作。

由于在1.中启动的复制是异步的，因此在复制正在进行时不会阻止2.继续进行，因此两者可以并行发生（这是收益）。由于步骤3.需要将x已经复制到GPU上，因此只有在完成1.后才能执行它-因此只有1.和2.可以重叠，并且3.一定会在之后发生。因此，non_blocking=True可以节省的时间最长为2.的持续时间。如果没有non_blocking=True，CPU将在传输完成之前等待空闲才能继续执行2.。

注意：也许步骤2.也可能包括GPU操作，只要它们不需要x-我不确定这是否正确，请不要引用我。

编辑：我认为您在基准测试中错过了重点。它有三个问题

更接近于pin_memory预期使用方式的基准测试如下：

import torchvision, torch, time
import numpy as np
 
pin_memory = True
batch_size = 1024 # bigger memory transfers to make their cost more noticable
n_workers = 6 # parallel workers to free up the main thread and reduce data decoding overhead
train_dataset =torchvision.datasets.CIFAR10(
    root='cifar10_pytorch',
    download=True,
    transform=torchvision.transforms.ToTensor()
)   
train_dataloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    pin_memory=pin_memory,
    num_workers=n_workers
)   
print('pin_memory:', pin_memory)
times = []
n_runs = 10

def work():
    # emulates the CPU work done
    time.sleep(0.1)

for i in range(n_runs):
    st = time.time()
    for bx, by in train_dataloader:
       bx, by = bx.cuda(non_blocking=pin_memory), by.cuda(non_blocking=pin_memory)
       work()
   times.append(time.time() - st)
print('average time:', np.mean(times))

我的机器使用内存固定技术平均需要5.48秒，而没有使用则需要5.72秒。