torch.manual_seed(seed)出现“RuntimeError: CUDA错误：设备端触发断言”

Question

torch.manual_seed(seed)出现“RuntimeError: CUDA错误：设备端触发断言”

4

当我在使用谷歌Colab时，出现了这个错误。这是我的代码，我没有发现任何问题，这些代码几小时前是正确的，但突然出错了，我不知道为什么。

import torch
if torch.cuda.is_available():       
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use the GPU:', torch.cuda.get_device_name(0))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")
seed=1
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True

错误是：

There are 1 GPU(s) available.
We will use the GPU: Tesla P100-PCIE-16GB
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-121-436d9d8bb120> in <module>()
      9 seed=1
     10 np.random.seed(seed)
---> 11 torch.manual_seed(seed)
     12 torch.cuda.manual_seed_all(seed)
     13 torch.backends.cudnn.deterministic = True

3 frames
/usr/local/lib/python3.7/dist-packages/torch/cuda/random.py in cb()
    109         for i in range(device_count()):
    110             default_generator = torch.cuda.default_generators[i]
--> 111             default_generator.manual_seed(seed)
    112 
    113     _lazy_call(cb, seed_all=True)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

有人可以帮帮我吗？

- Haorui He

2个回答

1

好吧，接受的答案似乎很奇怪。所呈现的代码与任何张量操作都没有任何关系 - 它只是随机生成器的初始化。

根据我的经验，当环境没有正确初始化时，可能会出现此错误。即cuDNN库可能未加载或存在其他与CUDA相关的问题。在我的情况下，缺少了调用包含Conda脚本（source /net/software/v1/software/Miniconda3/4.9.2/etc/profile.d/conda.sh）导致了此错误。

- Aleksander Pohl

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Carlo Longhi · Accepted Answer

根据我的经验，这个错误可能是由于目标数据中的标签数与模型中的类别数存在某种不一致导致的。

要解决这个问题，你可以尝试以下方法：

确保目标数据中的标签从0开始。如果你的数据有n个类别，则目标类别应该为[0, 1, 2,..., n-1]
确保你使用的模型设置为处理n个类别