Tensorflow与CUDNN_STATUS_ALLOC_FAILED冲突导致崩溃

Question

Tensorflow与CUDNN_STATUS_ALLOC_FAILED冲突导致崩溃

pythonpython-3.xtensorflowneural-network

6

我已经在网上搜索了几个小时，但没有结果，所以决定在这里问一下。

我正在尝试制作一个自动驾驶汽车，参考了Sentdex的教程，但是在运行模型时，出现了一堆致命错误。我在互联网上搜索了很多解决方案，许多人似乎都有同样的问题。然而，我找到的所有解决方案（包括此Stack-post）都不适用于我。

这是我的软件：

Tensorflow：1.5，GPU版本
CUDA：9.0，带补丁
CUDnn：7
Windows 10 Pro
Python 3.6

硬件：

Nvidia 1070ti，具有最新驱动程序
Intel i5 7600K

这是崩溃日志：

2018-02-04 16:29:33.606903: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:444] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2018-02-04 16:29:33.608872: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:444] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2018-02-04 16:29:33.609308: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_blas.cc:444] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2018-02-04 16:29:35.145249: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:385] could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED 2018-02-04 16:29:35.145563: E C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\stream_executor\cuda\cuda_dnn.cc:352] could not destroy cudnn handle: CUDNN_STATUS_BAD_PARAM 2018-02-04 16:29:35.149896: F C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\kernels\conv_ops.cc:717] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo<T>(), &algorithms)

这是我的代码：

 import tensorflow as tf
    import numpy as np
    import cv2
    import time
    from PIL import ImageGrab
    from getkeys import key_check
    from alexnet import alexnet
    import os
    from sendKeys import PressKey, ReleaseKey, W,A,S,D,Sp

    import random

    WIDTH = 80
    HEIGHT = 60
    LR = 1e-3
    EPOCHS = 10
    MODEL_NAME = 'DiRT-AI-Driver-{}-{}-{}-epochs.model'.format(LR, 'alexnetv2', EPOCHS)

    def straight():
        PressKey(W)
        ReleaseKey(A)
        ReleaseKey(S)
        ReleaseKey(D)
        ReleaseKey(Sp)
    def left():
        PressKey(A)
        ReleaseKey(W)
        ReleaseKey(S)
        ReleaseKey(D)
        ReleaseKey(Sp)
    def right():
        PressKey(D)
        ReleaseKey(A)
        ReleaseKey(S)
        ReleaseKey(W)
        ReleaseKey(Sp)
    def brake():
        PressKey(S)
        ReleaseKey(A)
        ReleaseKey(W)
        ReleaseKey(D)
        ReleaseKey(Sp)
    def handbrake():
        PressKey(Sp)
        ReleaseKey(A)
        ReleaseKey(S)
        ReleaseKey(D)
        ReleaseKey(W)

    model = alexnet(WIDTH, HEIGHT, LR)
    model.load(MODEL_NAME)


    def main():
        last_time = time.time()
        for i in list(range(4))[::-1]:
            print(i+1)
            time.sleep(1)


    paused = False
    while(True):
            if not paused:
                screen = np.array(ImageGrab.grab(bbox=(0,40,1024,768)))
                screen = cv2.cvtColor(screen,cv2.COLOR_BGR2GRAY)
                screen = cv2.resize(screen,(80,60))
                print('Loop took {} seconds'.format(time.time()-last_time))
                last_time = time.time()
                print('took time')
                prediction = model.predict([screen.reshape(WIDTH,HEIGHT,1)])[0]
                print('predicted')
                moves = list(np.around(prediction))
                print('got moves')
                print(moves,prediction)

                if moves == [1,0,0,0,0]:
                    straight()
                elif moves == [0,1,0,0,0]:
                    left()
                elif moves == [0,0,1,0,0]:
                    brake()
                elif moves == [0,0,0,1,0]:
                    right()
                elif moves == [0,0,0,0,1]:
                    handbrake()

            keys = key_check()

            if 'T' in keys:
                if paused:
                    pased = False
                    time.sleep(1)
                else:
                    paused = True
                    ReleaseKey(W)
                    ReleaseKey(A)
                    ReleaseKey(S)
                    ReleaseKey(D)
                    ReleaseKey(Sp)
                    time.sleep(1)


main()

我发现导致Python崩溃并产生前三个错误的代码行是：prediction = model.predict([screen.reshape(WIDTH,HEIGHT,1)])[0]。运行代码时，CPU占用率高达100％，表明出现了严重问题。GPU占用率约为40-50％。我尝试了Tensorflow 1.2和1.3，以及CUDA 8，但毫无成效。安装CUDA时，我没有安装特定的驱动程序，因为它们对我的GPU来说太旧了。尝试了不同的CUDnn，但也没有任何帮助。

- Gnoske

当运行代码时，CPU 使用率高达100％，这表明出现了严重问题。即使使用GPU，高CPU负载也是可以接受的。 - Eli Korvigo

我只在无限循环时看到CPU从空闲状态飙升到100%的情况，但如果你说这在这种情况下是正常的，那么应该没问题，也不应该是问题的一部分。 - Gnoske

6个回答

8

也许你的GPU内存不够了。

如果你正在使用TensorFlow 1.x：

第一种选择是将allow_growth设置为true。

import tensorflow as tf    
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

第二种选择）设置内存分配比例。

# change the memory fraction as you want

import tensorflow as tf
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.3)
sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))

如果您正在使用TensorFlow 2.x：

第一种选择）将set_memory_growth设置为true。

# Currently the ‘memory growth’ option should be the same for all GPUs.
# You should set the ‘memory growth’ option before initializing GPUs.

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

第二种选择)将memory_limit设置为您想要的值。只需在下面的代码中更改gpus和memory_limit的索引。

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
  except RuntimeError as e:
    print(e)

- starriet

由于这里有多个选项，我只想指定一下。对于TF 2.4版本，set_memory_growth选项适用于我。 - tonyd24601

2

尝试设置以下代码可以解决问题：os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true' 我的环境如下：

Cudnn 7.6.5

Tensorflow 2.4

Cuda Toolkit 10.1

RTX 2060

- Proxytype

1

我遇到了同样的问题，后来发现因为我也在使用GPU来运行其他任务，即使它不显示在任务管理器（Windows）中也在使用GPU。可能是类似于（渲染视频、视频编码或玩重负荷游戏、挖矿等）的事情。如果你认为它仍在使用大量GPU，请直接关闭它并解决问题。

- chickensoup

1

尝试将cuda路径添加到环境变量中。似乎问题与cuda有关。

在~/.bashrc中设置CUDA路径（使用nano编辑）：

#Cuda Nvidia path
$ export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
$ export CUDA_HOME=/usr/local/cuda

- David Jimenez

我删除了所有与CUDA相关的内容，进入了%PATH%并清除了所有CUDA变量。重新安装后，现在终于可以工作了！问题是我之前尝试过很多次，导致路径数量太多，它们可能相互冲突。 - Gnoske

好吧，看起来我太急了！现在它可能只有20％的时间工作。在其他运行中，我会得到相同的崩溃。变得更好了，但仍未按预期工作！ - Gnoske

抱歉，我忘记说在编辑完.bashrc文件后，您可能需要执行 $ source ~/.bashrc 命令。确保您只有一个环境变量的声明。 - David Jimenez

1

我在Win10上遇到了同样的问题，那么我该如何在Win10上添加新的环境变量呢？ - ShuangSong

我也在使用Windows 10，但是在Windows上执行了相同的步骤。要更改/添加环境变量，只需搜索“环境变量”，它就会弹出。然而，这并没有解决我的问题。删除了所有重复项，问题仍然存在。 - Gnoske

显示剩余2条评论

1

我曾经遇到过几乎相同的问题。通过重新安装tensorflow-gpu解决了这个问题。

conda uninstall tensorflow-gpu
conda install tensorflow-gpu

我认为 pip 也应该可以正常工作。

- Lucas Yokoy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Axel Puig · Accepted Answer

在我的情况下，问题是由于另一个已导入tensorflow的Python控制台正在运行。关闭它解决了这个问题。我使用的是Windows 10，主要错误如下：

无法创建cublas句柄：CUBLAS_STATUS_ALLOC_FAILED

无法创建cudnn句柄：CUDNN_STATUS_ALLOC_FAILED