CUDA Out of Memory Error in Stage 2 Fine-Tuning with A100 80GB
Hi,
I’m currently working on fine-tuning trace-uni on a downstream task and have successfully completed stage 1 (tuning the MLP adapters). However, during stage 2, where I try to tune the backbone, I’m encountering a CUDA out of memory error even though I’m using a single A100 80GB GPU—the highest-end GPU available.
Below is my training script:
#!/bin/bash
###############################################################################
# Stage 2: Training LLM Backbone
###############################################################################
# Environment Variables
WORLD_SIZE=0.0073 USD
MASTER_ADDR="127.0.0.1"
MASTER_PORT=16666
RANK=0
# Training Arguments
GLOBAL_BATCH_SIZE=1 # load number of videos in epoch
GRADIENT_ACCUMULATION_STEPS=1
LOCAL_BATCH_SIZE=1
echo "Local Batch Size: ${LOCAL_BATCH_SIZE}"
# CUDA and Logging Arguments
# Set max_split_size_mb to help with fragmentation and avoid OOM errors.
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
export WANDB_PROJECT=trace_greatest_hits
export NCCL_P2P_LEVEL=NVL
export HCCL_BUFFSIZE=1024
RUN_NAME=greatest_hits_stage2
DATA_DIR=datasets
OUTP_DIR=/path/to/trace_finetune/stage2
# Optionally, if you have control of the train script, add the following line
# at the very start of your Python code to clear cached memory:
# import torch; torch.cuda.empty_cache()
# Note: Remove any trailing spaces after the backslashes below.
ASCEND_LAUNCH_BLOCKING=1 torchrun --nnodes "${WORLD_SIZE}" \
--nproc_per_node "${NPROC_PER_NODE}" \
--master_addr="${MASTER_ADDR}" \
--master_port="${MASTER_PORT}" \
--node_rank "${RANK}" \
/path/to/trace_module/train_mt.py \
--version v1_mistral \
--vision_tower model/clip-vit-large-patch14-336 \
--mm_projector_type spatial_slot \
--freeze_mm_mlp_adapter True \
--tune_mm_mlp_adapter False \
--tune_mm_embed_head False \
--tune_lm_embed_head False \
--model_name_or_path /path/to/trace_finetune/stage1/model \
--data_path /path/to/final_annotations.json \
--data_folder /path/to/videos \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--downsample_num 1 \
--image_aspect_ratio pad \
--freeze_backbone False \
--num_frames 128 \
--bf16 True \
--fp16 False \
--output_dir "${OUTP_DIR}/model" \
--num_train_epochs 5 \
--per_device_train_batch_size "${LOCAL_BATCH_SIZE}" \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps "${GRADIENT_ACCUMULATION_STEPS}" \
--evaluation_strategy "no" \
--save_strategy "epoch" \
--save_steps 5000 \
--save_total_limit 99 \
--learning_rate 5e-6 \
--weight_decay 0.0 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 2 \
--run_name "${RUN_NAME}" \
--lazy_preprocess True \
--sample_scheme "rand" \
2> "${OUTP_DIR}/stage2.err"
Despite configuring the environment with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128, I still get the following error during execution:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 79.14 GiB total capacity; 77.72 GiB already allocated; 21.62 MiB free; 78.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Could you please provide guidance on how to resolve this issue? Are there specific modifications or additional environment settings that might help avoid these OOM errors during stage 2 fine-tuning, especially when tuning the backbone with the current configuration?
Thanks for your help and support!