Spaces:

hf-skills
/

llm-trainer

Running

App Files Files Community

llm-trainer / references /hardware_guide.md

burtenshaw HF Staff

Upload folder using huggingface_hub

6ab17a7 verified 5 days ago

preview code

raw

history blame contribute delete

6.78 kB

	# Hardware Selection Guide

	Choosing the right hardware (flavor) is critical for cost-effective training.

	## Available Hardware

	### CPU
	- `cpu-basic` - Basic CPU, testing only
	- `cpu-upgrade` - Enhanced CPU

	Use cases: Dataset validation, preprocessing, testing scripts
	Not recommended for training: Too slow for any meaningful training

	### GPU Options

	\| Flavor \| GPU \| Memory \| Use Case \| Cost/hour \|
	\|--------\|-----\|--------\|----------\|-----------\|
	\| `t4-small` \| NVIDIA T4 \| 16GB \| <1B models, demos \| ~$0.50-1 \|
	\| `t4-medium` \| NVIDIA T4 \| 16GB \| 1-3B models, development \| ~$1-2 \|
	\| `l4x1` \| NVIDIA L4 \| 24GB \| 3-7B models, efficient training \| ~$2-3 \|
	\| `l4x4` \| 4x NVIDIA L4 \| 96GB \| Multi-GPU training \| ~$8-12 \|
	\| `a10g-small` \| NVIDIA A10G \| 24GB \| 3-7B models, production \| ~$3-4 \|
	\| `a10g-large` \| NVIDIA A10G \| 24GB \| 7-13B models \| ~$4-6 \|
	\| `a10g-largex2` \| 2x NVIDIA A10G \| 48GB \| Multi-GPU, large models \| ~$8-12 \|
	\| `a10g-largex4` \| 4x NVIDIA A10G \| 96GB \| Multi-GPU, very large models \| ~$16-24 \|
	\| `a100-large` \| NVIDIA A100 \| 40GB \| 13B+ models, fast training \| ~$8-12 \|

	### TPU Options

	\| Flavor \| Type \| Use Case \|
	\|--------\|------\|----------\|
	\| `v5e-1x1` \| TPU v5e \| Small TPU workloads \|
	\| `v5e-2x2` \| 4x TPU v5e \| Medium TPU workloads \|
	\| `v5e-2x4` \| 8x TPU v5e \| Large TPU workloads \|

	Note: TPUs require TPU-optimized code. Most TRL training uses GPUs.

	## Selection Guidelines

	### By Model Size

	Tiny Models (<1B parameters)
	- Recommended: `t4-small`
	- Example: Qwen2.5-0.5B, TinyLlama
	- Batch size: 4-8
	- Training time: 1-2 hours for 1K examples

	Small Models (1-3B parameters)
	- Recommended: `t4-medium` or `a10g-small`
	- Example: Qwen2.5-1.5B, Phi-2
	- Batch size: 2-4
	- Training time: 2-4 hours for 10K examples

	Medium Models (3-7B parameters)
	- Recommended: `a10g-small` or `a10g-large`
	- Example: Qwen2.5-7B, Mistral-7B
	- Batch size: 1-2 (or LoRA with 4-8)
	- Training time: 4-8 hours for 10K examples

	Large Models (7-13B parameters)
	- Recommended: `a10g-large` or `a100-large`
	- Example: Llama-3-8B, Mixtral-8x7B (with LoRA)
	- Batch size: 1 (full fine-tuning) or 2-4 (LoRA)
	- Training time: 6-12 hours for 10K examples
	- Note: Always use LoRA/PEFT

	Very Large Models (13B+ parameters)
	- Recommended: `a100-large` with LoRA
	- Example: Llama-3-13B, Llama-3-70B (LoRA only)
	- Batch size: 1-2 with LoRA
	- Training time: 8-24 hours for 10K examples
	- Note: Full fine-tuning not feasible, use LoRA/PEFT

	### By Budget

	Minimal Budget (<$5 total)
	- Use `t4-small`
	- Train on subset of data (100-500 examples)
	- Limit to 1-2 epochs
	- Use small model (<1B)

	Small Budget ($5-20)
	- Use `t4-medium` or `a10g-small`
	- Train on 1K-5K examples
	- 2-3 epochs
	- Model up to 3B parameters

	Medium Budget ($20-50)
	- Use `a10g-small` or `a10g-large`
	- Train on 5K-20K examples
	- 3-5 epochs
	- Model up to 7B parameters

	Large Budget ($50-200)
	- Use `a10g-large` or `a100-large`
	- Full dataset training
	- Multiple epochs
	- Model up to 13B parameters with LoRA

	### By Training Type

	Quick Demo/Experiment
	- `t4-small`
	- 50-100 examples
	- 5-10 steps
	- ~10-15 minutes

	Development/Iteration
	- `t4-medium` or `a10g-small`
	- 1K examples
	- 1 epoch
	- ~30-60 minutes

	Production Training
	- `a10g-large` or `a100-large`
	- Full dataset
	- 3-5 epochs
	- 4-12 hours

	Research/Experimentation
	- `a100-large`
	- Multiple runs
	- Various hyperparameters
	- Budget for 20-50 hours

	## Memory Considerations

	### Estimating Memory Requirements

	Full fine-tuning:
	```
	Memory (GB) ≈ (Model params in billions) × 20
	```

	LoRA fine-tuning:
	```
	Memory (GB) ≈ (Model params in billions) × 4
	```

	Examples:
	- Qwen2.5-0.5B full: ~10GB ✅ fits t4-small
	- Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
	- Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small
	- Qwen2.5-7B full: ~140GB ❌ not feasible
	- Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large

	### Memory Optimization

	If hitting memory limits:

	1. Use LoRA/PEFT
	```python
	peft_config=LoraConfig(r=16, lora_alpha=32)
	```

	2. Reduce batch size
	```python
	per_device_train_batch_size=1
	```

	3. Increase gradient accumulation
	```python
	gradient_accumulation_steps=8 # Effective batch size = 1×8
	```

	4. Enable gradient checkpointing
	```python
	gradient_checkpointing=True
	```

	5. Use mixed precision
	```python
	bf16=True # or fp16=True
	```

	6. Upgrade to larger GPU
	- t4 → a10g → a100

	## Cost Estimation

	### Formula

	```
	Total Cost = (Hours of training) × (Cost per hour)
	```

	### Example Calculations

	Quick demo:
	- Hardware: t4-small ($0.75/hour)
	- Time: 15 minutes (0.25 hours)
	- Cost: $0.19

	Development training:
	- Hardware: a10g-small ($3.50/hour)
	- Time: 2 hours
	- Cost: $7.00

	Production training:
	- Hardware: a10g-large ($5/hour)
	- Time: 6 hours
	- Cost: $30.00

	Large model with LoRA:
	- Hardware: a100-large ($10/hour)
	- Time: 8 hours
	- Cost: $80.00

	### Cost Optimization Tips

	1. Start small: Test on t4-small with subset
	2. Use LoRA: 4-5x cheaper than full fine-tuning
	3. Optimize hyperparameters: Fewer epochs if possible
	4. Set appropriate timeout: Don't waste compute on stalled jobs
	5. Use checkpointing: Resume if job fails
	6. Monitor costs: Check running jobs regularly

	## Multi-GPU Training

	TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.

	Multi-GPU flavors:
	- `l4x4` - 4x L4 GPUs
	- `a10g-largex2` - 2x A10G GPUs
	- `a10g-largex4` - 4x A10G GPUs

	When to use:
	- Models >13B parameters
	- Need faster training (linear speedup)
	- Large datasets (>50K examples)

	Example:
	```python
	hf_jobs("uv", {
	"script": "train.py",
	"flavor": "a10g-largex2", # 2 GPUs
	"timeout": "4h",
	"secrets": {"HF_TOKEN": "$HF_TOKEN"}
	})
	```

	No code changes needed—TRL/Accelerate handles distribution automatically.

	## Choosing Between Options

	### a10g vs a100

	Choose a10g when:
	- Model <13B parameters
	- Budget conscious
	- Training time not critical

	Choose a100 when:
	- Model 13B+ parameters
	- Need fastest training
	- Memory requirements high
	- Budget allows

	### Single vs Multi-GPU

	Choose single GPU when:
	- Model <7B parameters
	- Budget constrained
	- Simpler debugging

	Choose multi-GPU when:
	- Model >13B parameters
	- Need faster training
	- Large batch sizes required
	- Cost-effective for large jobs

	## Quick Reference

	```python
	# Model size → Hardware selection
	HARDWARE_MAP = {
	"<1B": "t4-small",
	"1-3B": "a10g-small",
	"3-7B": "a10g-large",
	"7-13B": "a10g-large (LoRA) or a100-large",
	">13B": "a100-large (LoRA required)"
	}
	```