SmolFactory

Sleeping

App Files Files Community

SmolFactory / docs /A100_LARGE_SCALE_GUIDE.md

Tonic

adds formatting fix

ebe598e verified 5 months ago

preview code

raw

history blame

5.96 kB

	# A100 Large Scale Training Guide

	This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs.

	## Available Configurations

	### 1. A100 Large Batch Configuration
	File: `config/train_smollm3_openhermes_fr_a100_large.py`

	Key Features:
	- Effective Batch Size: 128 (8 × 16 gradient accumulation)
	- Training Duration: ~1.3 passes (8,000 steps)
	- Learning Rate: 5e-6 (optimized for large batches)
	- Mixed Precision: bf16 (A100 optimized)
	- Sequence Length: 8192 tokens
	- Memory Optimizations: No gradient checkpointing for A100 efficiency

	Estimated Training Time: ~6-8 hours on A100

	### 2. Multiple Passes Configuration
	File: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py`

	Key Features:
	- Effective Batch Size: 120 (6 × 20 gradient accumulation)
	- Training Duration: ~4 passes (25,000 steps)
	- Learning Rate: 3e-6 (conservative for long training)
	- Warmup Steps: 2000 (longer warmup for stability)
	- Checkpoint Strategy: More frequent saves (every 2000 steps)

	Estimated Training Time: ~20-24 hours on A100

	## Training Commands

	### Quick Start - Large Batch Experiment
	```bash
	python run_a100_large_experiment.py \
	--config config/train_smollm3_openhermes_fr_a100_large.py \
	--experiment-name "smollm3_openhermes_fr_large_batch" \
	--output-dir ./outputs/large_batch
	```

	### Multiple Passes Experiment
	```bash
	python run_a100_large_experiment.py \
	--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
	--experiment-name "smollm3_openhermes_fr_multiple_passes" \
	--output-dir ./outputs/multiple_passes
	```

	### Dry Run (Check Configuration)
	```bash
	python run_a100_large_experiment.py \
	--config config/train_smollm3_openhermes_fr_a100_large.py \
	--dry-run
	```

	### Resume Training
	```bash
	python run_a100_large_experiment.py \
	--config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \
	--resume ./outputs/multiple_passes/checkpoint-10000 \
	--output-dir ./outputs/multiple_passes
	```

	## Configuration Details

	### Memory Usage Optimization
	- Gradient Checkpointing: Disabled for A100 efficiency
	- Flash Attention: Enabled for memory efficiency
	- bf16 Mixed Precision: Better for A100 than fp16
	- Gradient Clipping: 1.0 for stability
	- Group by Length: Enabled for better batching

	### Data Loading Optimization
	- Num Workers: 8 for faster data loading
	- Pin Memory: Enabled for GPU transfer efficiency
	- Prefetch Factor: 2 for pipeline optimization

	### Training Stability
	- Conservative Learning Rate: Lower LR for large effective batch sizes
	- Longer Warmup: More warmup steps for stability
	- Higher Beta2: 0.999 for AdamW stability
	- Gradient Clipping: Prevents gradient explosion

	## Expected Results

	### Large Batch Configuration (1.3 passes)
	- Training Steps: 8,000
	- Effective Batch Size: 128
	- Steps per Epoch: ~6,250
	- Epochs: ~1.3
	- Expected Loss: Should converge to ~1.5-2.0

	### Multiple Passes Configuration (4 passes)
	- Training Steps: 25,000
	- Effective Batch Size: 120
	- Steps per Epoch: ~6,667
	- Epochs: ~3.75
	- Expected Loss: Should converge to ~1.2-1.5

	## Monitoring and Logging

	### Trackio Integration
	Both configurations include Trackio monitoring:
	- Metrics Logging: Every 25-50 steps
	- Artifact Logging: Model checkpoints
	- Config Logging: Training configuration

	### Checkpoint Strategy
	- Large Batch: Save every 1000 steps (8 checkpoints)
	- Multiple Passes: Save every 2000 steps (12 checkpoints)
	- Best Model: Automatically load best model at end

	## Hardware Requirements

	### Minimum Requirements
	- GPU: A100 80GB (or multiple A100s)
	- RAM: 64GB+ system RAM
	- Storage: 100GB+ for checkpoints and logs
	- Network: Fast internet for dataset download

	### Recommended Setup
	- GPU: 2-4x A100 80GB
	- RAM: 128GB+ system RAM
	- Storage: 500GB+ NVMe SSD
	- Network: 10Gbps+ connection

	## Troubleshooting

	### Out of Memory (OOM)
	If you encounter OOM errors:
	1. Reduce `batch_size` from 8 to 6 or 4
	2. Increase `gradient_accumulation_steps` to maintain effective batch size
	3. Reduce `max_seq_length` from 8192 to 4096

	### Slow Training
	If training is too slow:
	1. Increase `dataloader_num_workers` to 12-16
	2. Ensure you're using bf16 mixed precision
	3. Check that gradient checkpointing is disabled
	4. Verify flash attention is enabled

	### Convergence Issues
	If loss doesn't converge:
	1. Reduce learning rate by 2x
	2. Increase warmup steps
	3. Check gradient norms in logs
	4. Verify dataset quality

	## Customization

	### For Different Dataset Sizes
	Adjust `max_iters` based on your dataset size:
	```python
	# For 1M datapoints with effective batch size 120
	steps_per_epoch = 1000000 // 120 # ~8,333 steps
	max_iters = steps_per_epoch * desired_epochs
	```

	### For Different GPU Memory
	Adjust batch size and gradient accumulation:
	```python
	# For 40GB A100
	batch_size = 4
	gradient_accumulation_steps = 32 # Effective batch size = 128

	# For 24GB GPU
	batch_size = 2
	gradient_accumulation_steps = 64 # Effective batch size = 128
	```

	## Performance Tips

	1. Use bf16: Better than fp16 for A100
	2. Disable Gradient Checkpointing: A100 has enough memory
	3. Use Flash Attention: Memory efficient attention
	4. Group by Length: Better batching efficiency
	5. Pin Memory: Faster GPU transfers
	6. Multiple Workers: Faster data loading

	## Expected Timeline

	- Large Batch: 6-8 hours for 1.3 passes
	- Multiple Passes: 20-24 hours for 4 passes
	- Full Dataset (5+ passes): 30+ hours

	## Next Steps

	After training completes:
	1. Evaluate on validation set
	2. Test generation quality
	3. Push to Hugging Face Hub
	4. Deploy for inference

	For deployment instructions, see `DEPLOYMENT_GUIDE.md`.