CUDA Out of Memory Error in Stage 2 Fine-Tuning with A100 80GB

#3
by HassanRaza17 - opened

Hi,

I’m currently working on fine-tuning trace-uni on a downstream task and have successfully completed stage 1 (tuning the MLP adapters). However, during stage 2, where I try to tune the backbone, I’m encountering a CUDA out of memory error even though I’m using a single A100 80GB GPU—the highest-end GPU available.

Below is my training script:

#!/bin/bash 

############################################################################### 
# Stage 2: Training LLM Backbone
############################################################################### 

# Environment Variables 
WORLD_SIZE=0.0073 USD

MASTER_ADDR="127.0.0.1" 
MASTER_PORT=16666 
RANK=0 

# Training Arguments 
GLOBAL_BATCH_SIZE=1  # load number of videos in epoch 
GRADIENT_ACCUMULATION_STEPS=1 
LOCAL_BATCH_SIZE=1 
echo "Local Batch Size: ${LOCAL_BATCH_SIZE}" 

# CUDA and Logging Arguments 
# Set max_split_size_mb to help with fragmentation and avoid OOM errors. 
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128 
export WANDB_PROJECT=trace_greatest_hits 
export NCCL_P2P_LEVEL=NVL 
export HCCL_BUFFSIZE=1024 

RUN_NAME=greatest_hits_stage2 
DATA_DIR=datasets 
OUTP_DIR=/path/to/trace_finetune/stage2 

# Optionally, if you have control of the train script, add the following line 
# at the very start of your Python code to clear cached memory: 
#    import torch; torch.cuda.empty_cache() 

# Note: Remove any trailing spaces after the backslashes below. 
ASCEND_LAUNCH_BLOCKING=1 torchrun --nnodes "${WORLD_SIZE}" \ 
    --nproc_per_node "${NPROC_PER_NODE}" \ 
    --master_addr="${MASTER_ADDR}" \ 
    --master_port="${MASTER_PORT}" \ 
    --node_rank "${RANK}" \ 
    /path/to/trace_module/train_mt.py \ 
    --version v1_mistral \ 
    --vision_tower model/clip-vit-large-patch14-336 \ 
    --mm_projector_type spatial_slot \ 
    --freeze_mm_mlp_adapter True \ 
    --tune_mm_mlp_adapter False \ 
    --tune_mm_embed_head False \ 
    --tune_lm_embed_head False \ 
    --model_name_or_path /path/to/trace_finetune/stage1/model \ 
    --data_path /path/to/final_annotations.json \ 
    --data_folder /path/to/videos \ 
    --mm_vision_select_layer -2 \ 
    --mm_use_im_start_end False \ 
    --mm_use_im_patch_token False \ 
    --downsample_num 1 \ 
    --image_aspect_ratio pad \ 
    --freeze_backbone False \ 
    --num_frames 128 \ 
    --bf16 True \ 
    --fp16 False \ 
    --output_dir "${OUTP_DIR}/model" \ 
    --num_train_epochs 5 \ 
    --per_device_train_batch_size "${LOCAL_BATCH_SIZE}" \ 
    --per_device_eval_batch_size 1 \ 
    --gradient_accumulation_steps "${GRADIENT_ACCUMULATION_STEPS}" \ 
    --evaluation_strategy "no" \ 
    --save_strategy "epoch" \ 
    --save_steps 5000 \ 
    --save_total_limit 99 \ 
    --learning_rate 5e-6 \ 
    --weight_decay 0.0 \ 
    --warmup_ratio 0.03 \ 
    --lr_scheduler_type "cosine" \ 
    --logging_steps 1 \ 
    --model_max_length 4096 \ 
    --gradient_checkpointing True \ 
    --dataloader_num_workers 2 \ 
    --run_name "${RUN_NAME}" \ 
    --lazy_preprocess True \ 
    --sample_scheme "rand" \ 
    2> "${OUTP_DIR}/stage2.err"

Despite configuring the environment with PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128, I still get the following error during execution:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 79.14 GiB total capacity; 77.72 GiB already allocated; 21.62 MiB free; 78.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Could you please provide guidance on how to resolve this issue? Are there specific modifications or additional environment settings that might help avoid these OOM errors during stage 2 fine-tuning, especially when tuning the backbone with the current configuration?

Thanks for your help and support!

Sign up or log in to comment