迭代一段时间后，前向传递速度变慢了10000倍。

Question

迭代一段时间后，前向传递速度变慢了10000倍。

3

我实现了一个类似于PyTorch官方DCGAN教程的简单反卷积神经网络。zeros向量被反复传递给该网络。随着时间的推移，花费的时间显著减慢。我想知道这是什么原因以及如何解决。

代码：

import torch
import torch.nn as nn
import time

# JUST TO MEASURE TIME
class Timer:
    def __init__(self, msg):
        self.msg = msg

    def __enter__(self):
        self.start = time.process_time()
        return self

    def __exit__(self, *args):
        self.end = time.process_time()
        self.interval = self.end - self.start

        print('{}: {:.5f}'.format(self.msg, self.interval))

device = torch.device("cuda")

ngf, nc, nz, batchSize = 64, 1, 6, 1<<16
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.main = nn.Sequential(
            # input is Z, going into a convolution
            nn.ConvTranspose2d( nz, ngf * 4, 4, 1, 0, bias=False),
            nn.BatchNorm2d(ngf * 4),
            nn.ReLU(True),
            # state size. (ngf*4) x 4 x 4
            nn.ConvTranspose2d(ngf * 4, ngf * 2, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf * 2),
            nn.ReLU(True),
            # state size. (ngf*2) x 8 x 8
            nn.ConvTranspose2d( ngf * 2, ngf, 4, 2, 1, bias=False),
            nn.BatchNorm2d(ngf),
            nn.ReLU(True),
            # state size. (ngf) x 16 x 16
            nn.ConvTranspose2d( ngf, nc, 4, 2, 1, bias=False),
            nn.Tanh()
            # state size. (nc) x 32 x 32
        )

    def forward(self, input):
        return self.main(input)

# Create the generator
netG = Generator().to(device)

def weights_init(m):
    classname = m.__class__.__name__
    if classname.find('Conv') != -1:
        nn.init.normal_(m.weight.data, 0.0, 0.02)
    elif classname.find('BatchNorm') != -1:
        nn.init.normal_(m.weight.data, 1.0, 0.02)
        nn.init.constant_(m.bias.data, 0)

netG.apply(weights_init)

# torch.backends.cudnn.benchmark=True

while True:
    with Timer('Time elapsed'):
        with torch.no_grad():
            netG(torch.zeros([batchSize, nz, 1, 1], device=device))

结果:

耗时: 0.02309 耗时: 0.00072 耗时: 0.00208 耗时: 0.00128 耗时: 0.00119 耗时: 0.00153 耗时: 0.00176 耗时: 0.00170 耗时: 0.00185 耗时: 0.00188 耗时: 0.00191 耗时: 0.00190 耗时: 0.00171 耗时: 0.00176 耗时: 0.00167 耗时: 0.00120 耗时: 0.00168 耗时: 0.00169 耗时: 0.00166 耗时: 0.00167 耗时: 0.00171 耗时: 0.00168 耗时: 0.00168 耗时: 0.00168 耗时: 0.00169 耗时: 0.00177 耗时: 0.00173 耗时: 0.00176 耗时: 0.00173 耗时: 0.00171 耗时: 0.00168 耗时: 0.00173 耗时: 0.00168 耗时: 0.00178 耗时: 0.00169 耗时: 0.00171 耗时: 0.00168 耗时: 0.00169 耗时: 0.00169 耗时: 0.00173 耗时: 0.00154 耗时: 0.00170 耗时: 0.00167 耗时: 0.00224 耗时: 0.00117 耗时: 0.00175 耗时: 0.00168 耗时: 0.00173 耗时: 0.00169 耗时: 12.62760 耗时: 12.71425 耗时: 12.71379 耗时: 12.71846 耗时: 12.71909 耗时: 12.71898 耗时: 12.72288 耗时: 12.72157 耗时: 12.72226 耗时: 12.72456 耗时: 12.72350 耗时: 12.72480 耗时: 12.72644 耗时: 12.72337 耗时: 12.72424 耗时: 12.72538 耗时: 12.72533 耗时: 12.72510 耗时: 12.72507 耗时: 12.72806 耗时: 12.72865 耗时: 12.72764 耗时: 12.72431

我的GPU：Titan RTX
PyTorch版本：1.4
Python版本：3.7

- Arash Vahabpour

这是Python2.7吗？你的代码可能会保留通常会在后台删除的变量，因此最终可用的内存非常少。 - Kenan

这是Python3。如果你的猜测是正确的，那么我应该如何修复？ - Arash Vahabpour

奇怪，我在Python3中没有看到过super。我不太了解PyTorch，但这个可能会有所帮助。 - Kenan

我从PyTorch教程中复制了模块，使用Python 3时super没有任何问题。 - Arash Vahabpour

当然，这不是问题，只是在Python3+中很少见。链接有帮助吗？ - Kenan

2

只是提供信息，我可以在TITAN RTX上重现这个问题，尽管我不完全确定为什么会发生。我认为这是由于pytorch的异步行为导致的，因为如果你通过y = netG(...捕获输出，然后调用torch.cuda.synchronize()，那么每次迭代都需要相同的时间（大约12秒）。如果你在不分配netG(...的输出的情况下添加torch.cuda.synchronize()，它仍然不会等待，但我认为这是因为没有变量正在等待更新。 - jodag

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nopileos · Answer 1

我在我的Titan RTX上尝试了相同的代码，并得到了完全相同的行为。

所有的GPU调用都是异步的（正如评论中jodag所指出的），只有在需要时才进行同步，如果存在依赖关系。因此，为了测试它，我稍微改变了代码，使网络的输出实际上被使用，并创建了一个依赖关系。因此，现在在下一次迭代开始之前需要输出。

while True:
    with Timer('Time elapsed'):
        with torch.no_grad():
            output = netG(torch.zeros([batchSize, nz, 1, 1], device=device))
            print(output.mean())

现在总是需要12.8秒。所以jodag是完全正确的。这与异步调用GPU以及pytorch如何在内部处理所有内容有关。