Pytorch训练时出现CUDA内存溢出错误

Question

Pytorch训练时出现CUDA内存溢出错误

pythonpytorchtorchamazon-sagemakertorchvision

3

我正在尝试在AWS Sagemaker中训练PyTorch FLAIR模型。在此过程中，出现以下错误：

RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB (GPU 0; 11.17 GiB total capacity; 9.29 GiB already allocated; 7.31 MiB free; 10.80 GiB reserved in total by PyTorch)

我使用了sagemaker.pytorch.estimator.PyTorch类进行训练。

我尝试了不同类型的实例，从ml.m5、g4dn到p3(甚至是96GB内存的实例)。在ml.m5中，出现了CPUmemoryIssue错误，在g4dn中出现了GPUMemoryIssue错误，并且在P3中主要出现了GPUMemoryIssue错误，因为Pytorch只使用了12GB中的一块GPU（这里应该是指容量为8*12GB）。

即使在本地使用CPU机器进行了尝试，也无法完成此训练，并出现了以下错误：

RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 67108864 bytes. Buy new RAM!

模型训练脚本：

    corpus = ClassificationCorpus(data_folder, test_file='../data/exports/val.csv', train_file='../data/exports/train.csv')
                                          
    print("finished loading corpus")

    word_embeddings = [WordEmbeddings('glove'), FlairEmbeddings('news-forward-fast'), FlairEmbeddings('news-backward-fast')]

    document_embeddings = DocumentLSTMEmbeddings(word_embeddings, hidden_size=512, reproject_words=True, reproject_words_dimension=256)

    classifier = TextClassifier(document_embeddings, label_dictionary=corpus.make_label_dictionary(), multi_label=False)

    trainer = ModelTrainer(classifier, corpus, optimizer=Adam)

    trainer.train('../model_files', max_epochs=12,learning_rate=0.0001, train_with_dev=False, embeddings_storage_mode="none")

附言：我在本地GPU机器上使用了一台4GB GTX 1650 DDR5内存的机器以更小的数据集训练了同样的架构，速度非常快。

- Desmond

我想重点是：“更小的数据集”。 - Klaus D.

是的，我也这么想，但记录差异大约在1000条左右，仅此而已。 - Desmond

不，重点在于“相似的架构”。微小的变化可能会产生巨大的影响。 - Berriel

抱歉，我误导了您，实际上是相同的架构模型。只是数据集的差异也在于记录数是4000个与5000个。我的主要观点是，我认为问题出在Sagemaker训练上，在本地同样可以在不错的GPU上运行，只是我没有那样的基础设施。您能帮我解决这个问题，使得模型能够在Sagemaker中进行训练吗？ - Desmond

2个回答

1

好的，经过连续两天的调试，我终于找到了根本原因。我的理解是Flair在句子长度方面没有任何限制，也就是说，它以最长的句子长度作为最大值。这就导致了问题，因为在我的情况下，有一些内容包含了15万个单词，这对于内存加载嵌入来说太多了，即使是16GB的GPU也无法承载。

解决方法：对于这种长度的内容，您可以从任何位置（左/右/中间）取出n个单词（在我的情况下是10K），并截断其余部分，或者如果数量很少，可以忽略这些记录进行训练。

希望这样能帮助您顺利进行训练，就像在我的情况下发生的那样。

P.S.：如果您正在关注此线程并遇到类似问题，请随时评论回来，以便我能够探索并帮助您解决问题。

- Desmond

嘿，我有类似的问题。我能和你讨论一下吗？ - hidden layer

当然可以，@hiddenlayer，你可以给我发电子邮件到ranjandebnath.rd@gmail.com。 - Desmond

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ashwiniku918 · Accepted Answer

这个错误是因为你的GPU内存不足。你可以尝试以下几种方法：

减少训练数据的大小
减小模型的大小，例如隐藏层数量或深度
尝试减小批量大小