Spaces:
Sleeping
Sleeping
| # A100 Large Scale Training Guide | |
| This guide provides configurations and instructions for running fully-fledged experiments with multiple passes on the full OpenHermes-FR dataset (800k+ datapoints) using A100 GPUs. | |
| ## Available Configurations | |
| ### 1. A100 Large Batch Configuration | |
| **File**: `config/train_smollm3_openhermes_fr_a100_large.py` | |
| **Key Features**: | |
| - **Effective Batch Size**: 128 (8 × 16 gradient accumulation) | |
| - **Training Duration**: ~1.3 passes (8,000 steps) | |
| - **Learning Rate**: 5e-6 (optimized for large batches) | |
| - **Mixed Precision**: bf16 (A100 optimized) | |
| - **Sequence Length**: 8192 tokens | |
| - **Memory Optimizations**: No gradient checkpointing for A100 efficiency | |
| **Estimated Training Time**: ~6-8 hours on A100 | |
| ### 2. Multiple Passes Configuration | |
| **File**: `config/train_smollm3_openhermes_fr_a100_multiple_passes.py` | |
| **Key Features**: | |
| - **Effective Batch Size**: 120 (6 × 20 gradient accumulation) | |
| - **Training Duration**: ~4 passes (25,000 steps) | |
| - **Learning Rate**: 3e-6 (conservative for long training) | |
| - **Warmup Steps**: 2000 (longer warmup for stability) | |
| - **Checkpoint Strategy**: More frequent saves (every 2000 steps) | |
| **Estimated Training Time**: ~20-24 hours on A100 | |
| ## Training Commands | |
| ### Quick Start - Large Batch Experiment | |
| ```bash | |
| python run_a100_large_experiment.py \ | |
| --config config/train_smollm3_openhermes_fr_a100_large.py \ | |
| --experiment-name "smollm3_openhermes_fr_large_batch" \ | |
| --output-dir ./outputs/large_batch | |
| ``` | |
| ### Multiple Passes Experiment | |
| ```bash | |
| python run_a100_large_experiment.py \ | |
| --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \ | |
| --experiment-name "smollm3_openhermes_fr_multiple_passes" \ | |
| --output-dir ./outputs/multiple_passes | |
| ``` | |
| ### Dry Run (Check Configuration) | |
| ```bash | |
| python run_a100_large_experiment.py \ | |
| --config config/train_smollm3_openhermes_fr_a100_large.py \ | |
| --dry-run | |
| ``` | |
| ### Resume Training | |
| ```bash | |
| python run_a100_large_experiment.py \ | |
| --config config/train_smollm3_openhermes_fr_a100_multiple_passes.py \ | |
| --resume ./outputs/multiple_passes/checkpoint-10000 \ | |
| --output-dir ./outputs/multiple_passes | |
| ``` | |
| ## Configuration Details | |
| ### Memory Usage Optimization | |
| - **Gradient Checkpointing**: Disabled for A100 efficiency | |
| - **Flash Attention**: Enabled for memory efficiency | |
| - **bf16 Mixed Precision**: Better for A100 than fp16 | |
| - **Gradient Clipping**: 1.0 for stability | |
| - **Group by Length**: Enabled for better batching | |
| ### Data Loading Optimization | |
| - **Num Workers**: 8 for faster data loading | |
| - **Pin Memory**: Enabled for GPU transfer efficiency | |
| - **Prefetch Factor**: 2 for pipeline optimization | |
| ### Training Stability | |
| - **Conservative Learning Rate**: Lower LR for large effective batch sizes | |
| - **Longer Warmup**: More warmup steps for stability | |
| - **Higher Beta2**: 0.999 for AdamW stability | |
| - **Gradient Clipping**: Prevents gradient explosion | |
| ## Expected Results | |
| ### Large Batch Configuration (1.3 passes) | |
| - **Training Steps**: 8,000 | |
| - **Effective Batch Size**: 128 | |
| - **Steps per Epoch**: ~6,250 | |
| - **Epochs**: ~1.3 | |
| - **Expected Loss**: Should converge to ~1.5-2.0 | |
| ### Multiple Passes Configuration (4 passes) | |
| - **Training Steps**: 25,000 | |
| - **Effective Batch Size**: 120 | |
| - **Steps per Epoch**: ~6,667 | |
| - **Epochs**: ~3.75 | |
| - **Expected Loss**: Should converge to ~1.2-1.5 | |
| ## Monitoring and Logging | |
| ### Trackio Integration | |
| Both configurations include Trackio monitoring: | |
| - **Metrics Logging**: Every 25-50 steps | |
| - **Artifact Logging**: Model checkpoints | |
| - **Config Logging**: Training configuration | |
| ### Checkpoint Strategy | |
| - **Large Batch**: Save every 1000 steps (8 checkpoints) | |
| - **Multiple Passes**: Save every 2000 steps (12 checkpoints) | |
| - **Best Model**: Automatically load best model at end | |
| ## Hardware Requirements | |
| ### Minimum Requirements | |
| - **GPU**: A100 80GB (or multiple A100s) | |
| - **RAM**: 64GB+ system RAM | |
| - **Storage**: 100GB+ for checkpoints and logs | |
| - **Network**: Fast internet for dataset download | |
| ### Recommended Setup | |
| - **GPU**: 2-4x A100 80GB | |
| - **RAM**: 128GB+ system RAM | |
| - **Storage**: 500GB+ NVMe SSD | |
| - **Network**: 10Gbps+ connection | |
| ## Troubleshooting | |
| ### Out of Memory (OOM) | |
| If you encounter OOM errors: | |
| 1. Reduce `batch_size` from 8 to 6 or 4 | |
| 2. Increase `gradient_accumulation_steps` to maintain effective batch size | |
| 3. Reduce `max_seq_length` from 8192 to 4096 | |
| ### Slow Training | |
| If training is too slow: | |
| 1. Increase `dataloader_num_workers` to 12-16 | |
| 2. Ensure you're using bf16 mixed precision | |
| 3. Check that gradient checkpointing is disabled | |
| 4. Verify flash attention is enabled | |
| ### Convergence Issues | |
| If loss doesn't converge: | |
| 1. Reduce learning rate by 2x | |
| 2. Increase warmup steps | |
| 3. Check gradient norms in logs | |
| 4. Verify dataset quality | |
| ## Customization | |
| ### For Different Dataset Sizes | |
| Adjust `max_iters` based on your dataset size: | |
| ```python | |
| # For 1M datapoints with effective batch size 120 | |
| steps_per_epoch = 1000000 // 120 # ~8,333 steps | |
| max_iters = steps_per_epoch * desired_epochs | |
| ``` | |
| ### For Different GPU Memory | |
| Adjust batch size and gradient accumulation: | |
| ```python | |
| # For 40GB A100 | |
| batch_size = 4 | |
| gradient_accumulation_steps = 32 # Effective batch size = 128 | |
| # For 24GB GPU | |
| batch_size = 2 | |
| gradient_accumulation_steps = 64 # Effective batch size = 128 | |
| ``` | |
| ## Performance Tips | |
| 1. **Use bf16**: Better than fp16 for A100 | |
| 2. **Disable Gradient Checkpointing**: A100 has enough memory | |
| 3. **Use Flash Attention**: Memory efficient attention | |
| 4. **Group by Length**: Better batching efficiency | |
| 5. **Pin Memory**: Faster GPU transfers | |
| 6. **Multiple Workers**: Faster data loading | |
| ## Expected Timeline | |
| - **Large Batch**: 6-8 hours for 1.3 passes | |
| - **Multiple Passes**: 20-24 hours for 4 passes | |
| - **Full Dataset (5+ passes)**: 30+ hours | |
| ## Next Steps | |
| After training completes: | |
| 1. Evaluate on validation set | |
| 2. Test generation quality | |
| 3. Push to Hugging Face Hub | |
| 4. Deploy for inference | |
| For deployment instructions, see `DEPLOYMENT_GUIDE.md`. |