在AWS P3实例中进行opt6.7B微调时，出现了RuntimeError：期望的标量类型为Half，但找到了Float。

Question

在AWS P3实例中进行opt6.7B微调时，出现了RuntimeError：期望的标量类型为Half，但找到了Float。

pythonpytorchhuggingface-transformershuggingface

4

我有一段简单的代码，它接受一个opt6.7B模型并进行微调。当我在Google Colab（Tesla T4，16GB）上运行这段代码时，没有任何问题。但是当我尝试在AWS p3-2xlarge环境（Tesla V100 GPU，16GB）上运行相同的代码时，出现了错误。

RuntimeError: expected scalar type Half but found Float

为了能够在单个GPU上运行微调，我使用了LORA和peft。这两种方法都是通过相同的方式（pip install）安装的。我可以使用with torch.autocast("cuda"):，然后该错误就消失了。但是训练损失变得非常奇怪，意味着它不会逐渐减少，而是在一个很大的范围内波动（0-5）（如果我将模型更改为GPT-J，则损失始终为0），而对于colab的情况，损失是逐渐减少的。因此，我不确定是否使用with torch.autocast("cuda"):是好事还是坏事。

transfromeers版本在两种情况下都是4.28.0.dev0。Colab的Torch版本显示为1.13.1+cu116，而p3的版本为1.13.1（这是否意味着它没有CUDA支持？我怀疑，在此基础上执行torch.cuda.is_available()显示为True）。

我唯一看到的主要区别是，对于colab，bitsandbytes具有以下设置日志。

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118

对于 p3，它是以下内容

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/conda/envs/pytorch/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...

我错过了什么？我不会在这里发布代码。但它确实是一个非常基本的版本，使用LORA和peft在alpaca数据集上进行opt-6.7b的微调。

为什么它在colab上运行而在p3上却不行呢？任何帮助都欢迎：）

-------------------- 编辑

我正在发布一个实际尝试过的最小代码示例

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-6.7b", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers
from datasets import load_dataset

tokenizer.pad_token_id = 0
CUTOFF_LEN = 256

data = load_dataset("tatsu-lab/alpaca")

data = data.shuffle().map(
    lambda data_point: tokenizer(
        data_point['text'],
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length",
    ),
    batched=True
)
# data = load_dataset("Abirate/english_quotes")
# data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=400, 
        learning_rate=2e-5, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

这里是完整的堆栈跟踪

/tmp/ipykernel_24622/2601578793.py:2 in <module>                                                 │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_24622/2601578793.py'                        │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1639 in train        │
│                                                                                                  │
│   1636 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1637 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1638 │   │   )                                                                                 │
│ ❱ 1639 │   │   return inner_training_loop(                                                       │
│   1640 │   │   │   args=args,                                                                    │
│   1641 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1642 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1906 in              │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1903 │   │   │   │   │   with model.no_sync():                                                 │
│   1904 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1905 │   │   │   │   else:                                                                     │
│ ❱ 1906 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1907 │   │   │   │                                                                             │
│   1908 │   │   │   │   if (                                                                      │
│   1909 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:2662 in              │
│ training_step                                                                                    │
│                                                                                                  │
│   2659 │   │   │   loss = loss / self.args.gradient_accumulation_steps                           │
│   2660 │   │                                                                                     │
│   2661 │   │   if self.do_grad_scaling:                                                          │
│ ❱ 2662 │   │   │   self.scaler.scale(loss).backward()                                            │
│   2663 │   │   elif self.use_apex:                                                               │
│   2664 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss:                     │
│   2665 │   │   │   │   scaled_loss.backward()                                                    │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py:488 in backward             │
│                                                                                                  │
│    485 │   │   │   │   create_graph=create_graph,                                                │
│    486 │   │   │   │   inputs=inputs,                                                            │
│    487 │   │   │   )                                                                             │
│ ❱  488 │   │   torch.autograd.backward(                                                          │
│    489 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    490 │   │   )                                                                                 │
│    491                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward   │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply      │
│                                                                                                  │
│   264 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   265 │   │   │   │   │   │   │      "of them.")                                                 │
│   266 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 267 │   │   return user_fn(self, *args)                                                        │
│   268 │                                                                                          │
│   269 │   def apply_jvp(self, *args):                                                            │
│   270 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py:157 in backward    │
│                                                                                                  │
│   154 │   │   │   raise RuntimeError(                                                            │
│   155 │   │   │   │   "none of output has requires_grad=True,"                                   │
│   156 │   │   │   │   " this checkpoint() is not necessary")                                     │
│ ❱ 157 │   │   torch.autograd.backward(outputs_with_grad, args_with_grad)                         │
│   158 │   │   grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None                  │
│   159 │   │   │   │   │     for inp in detached_inputs)                                          │
│   160                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward   │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply      │
│                                                                                                  │
│   264 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   265 │   │   │   │   │   │   │      "of them.")                                                 │
│   266 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 267 │   │   return user_fn(self, *args)                                                        │
│   268 │                                                                                          │
│   269 │   def apply_jvp(self, *args):                                                            │
│   270 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:456 in   │
│ backward                                                                                         │
│                                                                                                  │
│   453 │   │   │                                                                                  │
│   454 │   │   │   elif state.CB is not None:                                                     │
│   455 │   │   │   │   CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul   │
│ ❱ 456 │   │   │   │   grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype   │
│   457 │   │   │   elif state.CxB is not None:                                                    │
│   458 │   │   │   │                                                                              │
│   459 │   │   │   │   if state.tile_indices is None:

（很抱歉如果这是一个非常初级的问题，但我目前没有解决方案 :( )

- SRC

没有看到代码，很难回答你的问题。错误信息表明，你的一些代码生成了float32张量，而opt7.6b是一个使用float16的模型。你能否提供完整的错误堆栈跟踪和一个最小的可重现示例？ - cronoik

@cronoik，感谢您的回复。我已经在上面发布了一个代码示例。 - SRC

就我所知，我相信至少需要7.5+的计算能力才能使此代码正常工作。我已经成功在GCP上启动了一个T4实例，并从头开始安装了所有东西（包括CUDA）。使用Anaconda安装cutatoolkit，然后安装butsandbytes和所有其他必要的组件。它再次奏效了。如果有人感兴趣，这里是计算能力列表-https://developer.nvidia.com/cuda-gpus - SRC

3个回答

1

可能是由于V100 GPU上的混合精度问题。您可以尝试禁用fp16：

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=400, 
        learning_rate=2e-5, 
        fp16=False, # disable mixed precision
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

如下方评论所述，这会增加GPU内存使用量，因为混合精度被禁用了。

- Barata Magnus

这显著增加了你所需的GPU内存量。 - Ariel Lubonja

增加GPU内存需求总比无法运行训练要好。 - Barata Magnus

0

我和你有相同的错误: 当我添加以下代码时： with torch.autocast("cuda"): trainer.train() 损失为0; 我怀疑 Bitsandbytes 在使用 load_int8=True 和 fp16=True 时无法支持 V100。

- moses hu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- chero · Accepted Answer

我遇到了同样的错误。在谷歌上搜索后，最终通过在我的训练方法之前添加torch.autocast("cuda")代码来解决问题。就像这样：

 with torch.autocast("cuda"): 
    trainer.train()