llm-trainer / references /hardware_guide.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
6ab17a7 verified
# Hardware Selection Guide
Choosing the right hardware (flavor) is critical for cost-effective training.
## Available Hardware
### CPU
- `cpu-basic` - Basic CPU, testing only
- `cpu-upgrade` - Enhanced CPU
**Use cases:** Dataset validation, preprocessing, testing scripts
**Not recommended for training:** Too slow for any meaningful training
### GPU Options
| Flavor | GPU | Memory | Use Case | Cost/hour |
|--------|-----|--------|----------|-----------|
| `t4-small` | NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 |
| `t4-medium` | NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
| `l4x1` | NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 |
| `l4x4` | 4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 |
| `a10g-small` | NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
| `a10g-large` | NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
| `a10g-largex2` | 2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
| `a10g-largex4` | 4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
| `a100-large` | NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 |
### TPU Options
| Flavor | Type | Use Case |
|--------|------|----------|
| `v5e-1x1` | TPU v5e | Small TPU workloads |
| `v5e-2x2` | 4x TPU v5e | Medium TPU workloads |
| `v5e-2x4` | 8x TPU v5e | Large TPU workloads |
**Note:** TPUs require TPU-optimized code. Most TRL training uses GPUs.
## Selection Guidelines
### By Model Size
**Tiny Models (<1B parameters)**
- **Recommended:** `t4-small`
- **Example:** Qwen2.5-0.5B, TinyLlama
- **Batch size:** 4-8
- **Training time:** 1-2 hours for 1K examples
**Small Models (1-3B parameters)**
- **Recommended:** `t4-medium` or `a10g-small`
- **Example:** Qwen2.5-1.5B, Phi-2
- **Batch size:** 2-4
- **Training time:** 2-4 hours for 10K examples
**Medium Models (3-7B parameters)**
- **Recommended:** `a10g-small` or `a10g-large`
- **Example:** Qwen2.5-7B, Mistral-7B
- **Batch size:** 1-2 (or LoRA with 4-8)
- **Training time:** 4-8 hours for 10K examples
**Large Models (7-13B parameters)**
- **Recommended:** `a10g-large` or `a100-large`
- **Example:** Llama-3-8B, Mixtral-8x7B (with LoRA)
- **Batch size:** 1 (full fine-tuning) or 2-4 (LoRA)
- **Training time:** 6-12 hours for 10K examples
- **Note:** Always use LoRA/PEFT
**Very Large Models (13B+ parameters)**
- **Recommended:** `a100-large` with LoRA
- **Example:** Llama-3-13B, Llama-3-70B (LoRA only)
- **Batch size:** 1-2 with LoRA
- **Training time:** 8-24 hours for 10K examples
- **Note:** Full fine-tuning not feasible, use LoRA/PEFT
### By Budget
**Minimal Budget (<$5 total)**
- Use `t4-small`
- Train on subset of data (100-500 examples)
- Limit to 1-2 epochs
- Use small model (<1B)
**Small Budget ($5-20)**
- Use `t4-medium` or `a10g-small`
- Train on 1K-5K examples
- 2-3 epochs
- Model up to 3B parameters
**Medium Budget ($20-50)**
- Use `a10g-small` or `a10g-large`
- Train on 5K-20K examples
- 3-5 epochs
- Model up to 7B parameters
**Large Budget ($50-200)**
- Use `a10g-large` or `a100-large`
- Full dataset training
- Multiple epochs
- Model up to 13B parameters with LoRA
### By Training Type
**Quick Demo/Experiment**
- `t4-small`
- 50-100 examples
- 5-10 steps
- ~10-15 minutes
**Development/Iteration**
- `t4-medium` or `a10g-small`
- 1K examples
- 1 epoch
- ~30-60 minutes
**Production Training**
- `a10g-large` or `a100-large`
- Full dataset
- 3-5 epochs
- 4-12 hours
**Research/Experimentation**
- `a100-large`
- Multiple runs
- Various hyperparameters
- Budget for 20-50 hours
## Memory Considerations
### Estimating Memory Requirements
**Full fine-tuning:**
```
Memory (GB) ≈ (Model params in billions) × 20
```
**LoRA fine-tuning:**
```
Memory (GB) ≈ (Model params in billions) × 4
```
**Examples:**
- Qwen2.5-0.5B full: ~10GB ✅ fits t4-small
- Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
- Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small
- Qwen2.5-7B full: ~140GB ❌ not feasible
- Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large
### Memory Optimization
If hitting memory limits:
1. **Use LoRA/PEFT**
```python
peft_config=LoraConfig(r=16, lora_alpha=32)
```
2. **Reduce batch size**
```python
per_device_train_batch_size=1
```
3. **Increase gradient accumulation**
```python
gradient_accumulation_steps=8 # Effective batch size = 1×8
```
4. **Enable gradient checkpointing**
```python
gradient_checkpointing=True
```
5. **Use mixed precision**
```python
bf16=True # or fp16=True
```
6. **Upgrade to larger GPU**
- t4 → a10g → a100
## Cost Estimation
### Formula
```
Total Cost = (Hours of training) × (Cost per hour)
```
### Example Calculations
**Quick demo:**
- Hardware: t4-small ($0.75/hour)
- Time: 15 minutes (0.25 hours)
- Cost: $0.19
**Development training:**
- Hardware: a10g-small ($3.50/hour)
- Time: 2 hours
- Cost: $7.00
**Production training:**
- Hardware: a10g-large ($5/hour)
- Time: 6 hours
- Cost: $30.00
**Large model with LoRA:**
- Hardware: a100-large ($10/hour)
- Time: 8 hours
- Cost: $80.00
### Cost Optimization Tips
1. **Start small:** Test on t4-small with subset
2. **Use LoRA:** 4-5x cheaper than full fine-tuning
3. **Optimize hyperparameters:** Fewer epochs if possible
4. **Set appropriate timeout:** Don't waste compute on stalled jobs
5. **Use checkpointing:** Resume if job fails
6. **Monitor costs:** Check running jobs regularly
## Multi-GPU Training
TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.
**Multi-GPU flavors:**
- `l4x4` - 4x L4 GPUs
- `a10g-largex2` - 2x A10G GPUs
- `a10g-largex4` - 4x A10G GPUs
**When to use:**
- Models >13B parameters
- Need faster training (linear speedup)
- Large datasets (>50K examples)
**Example:**
```python
hf_jobs("uv", {
"script": "train.py",
"flavor": "a10g-largex2", # 2 GPUs
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
```
No code changes needed—TRL/Accelerate handles distribution automatically.
## Choosing Between Options
### a10g vs a100
**Choose a10g when:**
- Model <13B parameters
- Budget conscious
- Training time not critical
**Choose a100 when:**
- Model 13B+ parameters
- Need fastest training
- Memory requirements high
- Budget allows
### Single vs Multi-GPU
**Choose single GPU when:**
- Model <7B parameters
- Budget constrained
- Simpler debugging
**Choose multi-GPU when:**
- Model >13B parameters
- Need faster training
- Large batch sizes required
- Cost-effective for large jobs
## Quick Reference
```python
# Model size → Hardware selection
HARDWARE_MAP = {
"<1B": "t4-small",
"1-3B": "a10g-small",
"3-7B": "a10g-large",
"7-13B": "a10g-large (LoRA) or a100-large",
">13B": "a100-large (LoRA required)"
}
```