Spaces:
Running
Running
| # Hardware Selection Guide | |
| Choosing the right hardware (flavor) is critical for cost-effective training. | |
| ## Available Hardware | |
| ### CPU | |
| - `cpu-basic` - Basic CPU, testing only | |
| - `cpu-upgrade` - Enhanced CPU | |
| **Use cases:** Dataset validation, preprocessing, testing scripts | |
| **Not recommended for training:** Too slow for any meaningful training | |
| ### GPU Options | |
| | Flavor | GPU | Memory | Use Case | Cost/hour | | |
| |--------|-----|--------|----------|-----------| | |
| | `t4-small` | NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 | | |
| | `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 | | |
| | `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 | | |
| | `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 | | |
| | `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 | | |
| | `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 | | |
| | `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 | | |
| | `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 | | |
| | `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 | | |
| ### TPU Options | |
| | Flavor | Type | Use Case | | |
| |--------|------|----------| | |
| | `v5e-1x1` | TPU v5e | Small TPU workloads | | |
| | `v5e-2x2` | 4x TPU v5e | Medium TPU workloads | | |
| | `v5e-2x4` | 8x TPU v5e | Large TPU workloads | | |
| **Note:** TPUs require TPU-optimized code. Most TRL training uses GPUs. | |
| ## Selection Guidelines | |
| ### By Model Size | |
| **Tiny Models (<1B parameters)** | |
| - **Recommended:** `t4-small` | |
| - **Example:** Qwen2.5-0.5B, TinyLlama | |
| - **Batch size:** 4-8 | |
| - **Training time:** 1-2 hours for 1K examples | |
| **Small Models (1-3B parameters)** | |
| - **Recommended:** `t4-medium` or `a10g-small` | |
| - **Example:** Qwen2.5-1.5B, Phi-2 | |
| - **Batch size:** 2-4 | |
| - **Training time:** 2-4 hours for 10K examples | |
| **Medium Models (3-7B parameters)** | |
| - **Recommended:** `a10g-small` or `a10g-large` | |
| - **Example:** Qwen2.5-7B, Mistral-7B | |
| - **Batch size:** 1-2 (or LoRA with 4-8) | |
| - **Training time:** 4-8 hours for 10K examples | |
| **Large Models (7-13B parameters)** | |
| - **Recommended:** `a10g-large` or `a100-large` | |
| - **Example:** Llama-3-8B, Mixtral-8x7B (with LoRA) | |
| - **Batch size:** 1 (full fine-tuning) or 2-4 (LoRA) | |
| - **Training time:** 6-12 hours for 10K examples | |
| - **Note:** Always use LoRA/PEFT | |
| **Very Large Models (13B+ parameters)** | |
| - **Recommended:** `a100-large` with LoRA | |
| - **Example:** Llama-3-13B, Llama-3-70B (LoRA only) | |
| - **Batch size:** 1-2 with LoRA | |
| - **Training time:** 8-24 hours for 10K examples | |
| - **Note:** Full fine-tuning not feasible, use LoRA/PEFT | |
| ### By Budget | |
| **Minimal Budget (<$5 total)** | |
| - Use `t4-small` | |
| - Train on subset of data (100-500 examples) | |
| - Limit to 1-2 epochs | |
| - Use small model (<1B) | |
| **Small Budget ($5-20)** | |
| - Use `t4-medium` or `a10g-small` | |
| - Train on 1K-5K examples | |
| - 2-3 epochs | |
| - Model up to 3B parameters | |
| **Medium Budget ($20-50)** | |
| - Use `a10g-small` or `a10g-large` | |
| - Train on 5K-20K examples | |
| - 3-5 epochs | |
| - Model up to 7B parameters | |
| **Large Budget ($50-200)** | |
| - Use `a10g-large` or `a100-large` | |
| - Full dataset training | |
| - Multiple epochs | |
| - Model up to 13B parameters with LoRA | |
| ### By Training Type | |
| **Quick Demo/Experiment** | |
| - `t4-small` | |
| - 50-100 examples | |
| - 5-10 steps | |
| - ~10-15 minutes | |
| **Development/Iteration** | |
| - `t4-medium` or `a10g-small` | |
| - 1K examples | |
| - 1 epoch | |
| - ~30-60 minutes | |
| **Production Training** | |
| - `a10g-large` or `a100-large` | |
| - Full dataset | |
| - 3-5 epochs | |
| - 4-12 hours | |
| **Research/Experimentation** | |
| - `a100-large` | |
| - Multiple runs | |
| - Various hyperparameters | |
| - Budget for 20-50 hours | |
| ## Memory Considerations | |
| ### Estimating Memory Requirements | |
| **Full fine-tuning:** | |
| ``` | |
| Memory (GB) ≈ (Model params in billions) × 20 | |
| ``` | |
| **LoRA fine-tuning:** | |
| ``` | |
| Memory (GB) ≈ (Model params in billions) × 4 | |
| ``` | |
| **Examples:** | |
| - Qwen2.5-0.5B full: ~10GB ✅ fits t4-small | |
| - Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs | |
| - Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small | |
| - Qwen2.5-7B full: ~140GB ❌ not feasible | |
| - Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large | |
| ### Memory Optimization | |
| If hitting memory limits: | |
| 1. **Use LoRA/PEFT** | |
| ```python | |
| peft_config=LoraConfig(r=16, lora_alpha=32) | |
| ``` | |
| 2. **Reduce batch size** | |
| ```python | |
| per_device_train_batch_size=1 | |
| ``` | |
| 3. **Increase gradient accumulation** | |
| ```python | |
| gradient_accumulation_steps=8 # Effective batch size = 1×8 | |
| ``` | |
| 4. **Enable gradient checkpointing** | |
| ```python | |
| gradient_checkpointing=True | |
| ``` | |
| 5. **Use mixed precision** | |
| ```python | |
| bf16=True # or fp16=True | |
| ``` | |
| 6. **Upgrade to larger GPU** | |
| - t4 → a10g → a100 | |
| ## Cost Estimation | |
| ### Formula | |
| ``` | |
| Total Cost = (Hours of training) × (Cost per hour) | |
| ``` | |
| ### Example Calculations | |
| **Quick demo:** | |
| - Hardware: t4-small ($0.75/hour) | |
| - Time: 15 minutes (0.25 hours) | |
| - Cost: $0.19 | |
| **Development training:** | |
| - Hardware: a10g-small ($3.50/hour) | |
| - Time: 2 hours | |
| - Cost: $7.00 | |
| **Production training:** | |
| - Hardware: a10g-large ($5/hour) | |
| - Time: 6 hours | |
| - Cost: $30.00 | |
| **Large model with LoRA:** | |
| - Hardware: a100-large ($10/hour) | |
| - Time: 8 hours | |
| - Cost: $80.00 | |
| ### Cost Optimization Tips | |
| 1. **Start small:** Test on t4-small with subset | |
| 2. **Use LoRA:** 4-5x cheaper than full fine-tuning | |
| 3. **Optimize hyperparameters:** Fewer epochs if possible | |
| 4. **Set appropriate timeout:** Don't waste compute on stalled jobs | |
| 5. **Use checkpointing:** Resume if job fails | |
| 6. **Monitor costs:** Check running jobs regularly | |
| ## Multi-GPU Training | |
| TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors. | |
| **Multi-GPU flavors:** | |
| - `l4x4` - 4x L4 GPUs | |
| - `a10g-largex2` - 2x A10G GPUs | |
| - `a10g-largex4` - 4x A10G GPUs | |
| **When to use:** | |
| - Models >13B parameters | |
| - Need faster training (linear speedup) | |
| - Large datasets (>50K examples) | |
| **Example:** | |
| ```python | |
| hf_jobs("uv", { | |
| "script": "train.py", | |
| "flavor": "a10g-largex2", # 2 GPUs | |
| "timeout": "4h", | |
| "secrets": {"HF_TOKEN": "$HF_TOKEN"} | |
| }) | |
| ``` | |
| No code changes needed—TRL/Accelerate handles distribution automatically. | |
| ## Choosing Between Options | |
| ### a10g vs a100 | |
| **Choose a10g when:** | |
| - Model <13B parameters | |
| - Budget conscious | |
| - Training time not critical | |
| **Choose a100 when:** | |
| - Model 13B+ parameters | |
| - Need fastest training | |
| - Memory requirements high | |
| - Budget allows | |
| ### Single vs Multi-GPU | |
| **Choose single GPU when:** | |
| - Model <7B parameters | |
| - Budget constrained | |
| - Simpler debugging | |
| **Choose multi-GPU when:** | |
| - Model >13B parameters | |
| - Need faster training | |
| - Large batch sizes required | |
| - Cost-effective for large jobs | |
| ## Quick Reference | |
| ```python | |
| # Model size → Hardware selection | |
| HARDWARE_MAP = { | |
| "<1B": "t4-small", | |
| "1-3B": "a10g-small", | |
| "3-7B": "a10g-large", | |
| "7-13B": "a10g-large (LoRA) or a100-large", | |
| ">13B": "a100-large (LoRA required)" | |
| } | |
| ``` | |