# Multi-GPU Training Fix ## Problem Training works with 1 GPU but fails with 4 GPUs with error: ``` torch.AcceleratorError: CUDA error: device-side assert triggered ``` ## Root Cause The issue occurs because of **data format incompatibility** with Distributed Data Parallel (DDP) training: 1. **Text-based image format** (`` as string) doesn't work reliably with multi-GPU 2. **Data worker race conditions** when preprocessing data across multiple processes 3. Vision-language models need consistent tokenization across all GPUs ## Solutions Applied ### 1. Use Structured Data Format ✅ **Changed:** ```bash DATASET="data/training/sft_frames_dataset_1500_sft.json" # ❌ Text format ``` **To:** ```bash DATASET="data/training/sft_frames_dataset_1500.json" # ✅ Structured format ``` **Why it works:** - The structured format uses native ms-swift/transformers multimodal format - Images are represented as `{"type": "image", "image": "path"}` objects - Preprocessing is consistent across all GPU workers **Format comparison:** ❌ **Bad (text format):** ```json { "conversation": [ { "from": "human", "value": "\nPrompt text" } ] } ``` ✅ **Good (structured format):** ```json { "conversation": [ { "role": "user", "content": [ {"type": "image", "image": "data/frames/img.jpg"}, {"type": "text", "text": "Prompt text"} ] } ] } ``` ### 2. Reduced Data Workers ✅ **Changed:** ```bash --dataset_num_proc 4 \ --dataloader_num_workers 4 ``` **To:** ```bash --dataset_num_proc 1 \ --dataloader_num_workers 1 ``` **Why it helps:** - Eliminates race conditions in data preprocessing - Ensures consistent tokenization across GPUs - Slight performance hit but much more stable ### 3. Added CUDA Error Debugging ✅ **Added:** ```bash CUDA_LAUNCH_BLOCKING=1 ``` **Why it helps:** - Synchronous CUDA operations for clearer error messages - Shows exactly which operation caused the error - Essential for debugging multi-GPU issues ## Usage ### Run with 4 GPUs: ```bash GPU="0,1,2,3" bash scripts/train_reward_model.sh ``` ### Run with specific dataset: ```bash DATASET="data/training/sft_frames_dataset_1500.json" GPU="0,1,2,3" bash scripts/train_reward_model.sh ``` ### If issues persist: ```bash # Further reduce batch size BATCH_SIZE=1 GRAD_ACCUM=4 GPU="0,1,2,3" bash scripts/train_reward_model.sh # Or start with 2 GPUs to test GPU="0,1" bash scripts/train_reward_model.sh ``` ## Data Pipeline Fix If you regenerate data, make sure to use the structured format: **DON'T convert to text format:** ```python # ❌ SKIP this step in convert_to_sft.py converted.append({ "conversation": [ {"from": "human", "value": f"\n{text}"}, {"from": "assistant", "value": response} ] }) ``` **DO use structured format:** ```python # ✅ USE this format sample = { "conversation": [ { "role": "user", "content": [ {"type": "image", "image": str(frame_path)}, {"type": "text", "text": "Your prompt"} ] }, { "role": "assistant", "content": json.dumps(response_dict) } ] } ``` ## Verification Check if your dataset has the correct format: ```bash head -50 data/training/sft_frames_dataset_1500.json | python3 -m json.tool ``` Look for: - ✅ `"role": "user"` and `"content": [...]` (structured) - ❌ `"from": "human"` and `"value": "