flint-1.2B / SCALING_GUIDE.md
tekkmaven's picture
Upload SCALING_GUIDE.md
756b84c verified
# Flint-1.2B: Scaling Strategy Guide
# Deciding how to spend your TPU hours
## Decision Matrix
| TPU Hours | Strategy | Config | Tokens | Key Focus |
|-----------|----------|--------|--------|-----------|
| **17h** | 3-stage TAP | `flint_17h.yaml` | ~6.5B | Foundation + early reasoning |
| **40h** | 4-stage TAP | `flint_40h.yaml` | ~15B | Balanced reasoning + agency |
| **60h** | 5-stage TAP | `flint_60h.yaml` | ~23B | Deep capability crystallization |
| **100h+** | 6-stage + context ext | `flint_100h.yaml` | ~40B | Full + 8K context |
## What Each Extra Hour Buys You
### 17h β†’ 40h (+23h)
- **+8.5B tokens** (2.3Γ— total)
- Adds dedicated "Polish" stage for balanced skill refinement
- Full Orca-AgentInstruct (1M examples vs subset)
- FineMath-3plus gets proper multi-epoch exposure
- **Expected gain: +5-8 MMLU, +8-12 GSM8K**
### 40h β†’ 60h (+20h)
- **+8B tokens** (1.5Γ— over 40h)
- Adds dedicated "Agency" stage focused on tool mastery
- 5th "Crystal" annealing stage with premium data only
- Multi-language code (10+ languages vs 5)
- **Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use**
### 60h β†’ 100h+ (+40h)
- **+17B tokens** (1.7Γ— over 60h)
- Context extension to 8K (final 10h)
- arXiv + Wikipedia for factual grounding
- Maximum overtraining β†’ near Chinchilla-optimal
- **Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context**
## Scaling Principles (from literature)
1. **Overtraining works reliably** (arxiv:2403.08540)
- Performance scales predictably up to 640Γ— tokens/param
- Diminishing returns but never negative
2. **Data repetition is safe up to ~4 epochs** (Muennighoff et al.)
- OpenThoughts-114k can be repeated 5Γ— safely
- Large web data (FineWeb-Edu, DCLM) should not repeat
3. **Quality annealing gives outsized returns** (SmolLM2, OLMo-2)
- Final 10-15% of training with best data = biggest benchmark jumps
- FineMath-4+ and hard reasoning problems during LR decay
4. **Inference-optimal training favors overtraining** (arxiv:2401.00448)
- For deployment (inference cost matters): train smaller model longer
- 1.2B Γ— 23B tokens beats 1.7B Γ— 16B tokens for inference efficiency
5. **Muon gives ~2Γ— sample efficiency** (arxiv:2502.16982)
- Your 23B tokens with Muon β‰ˆ 46B tokens with AdamW
- Stacks with quality data β†’ multiplicative gains
## Multi-Session Management
For runs spanning multiple Kaggle sessions:
```bash
# Session 1: Start training
python train_flint.py --config configs/flint_60h.yaml
# Session 2: Auto-resumes from latest checkpoint
python train_flint.py --config configs/flint_60h.yaml --resume
# Resume from specific step (if latest is corrupted)
python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000
# List available checkpoints
python train_flint.py --config configs/flint_60h.yaml --list_checkpoints
```
## Post-Training (after pretraining is complete)
### Recommended sequence:
1. **SFT** (2-4h): SmolTalk full dataset
2. **GRPO** (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe)
3. **Tool DPO** (2-4h): Correct vs incorrect tool-call pairs
4. **Quantization**: INT4 AWQ for edge deployment
### If you have unlimited compute:
- Scale to 1.7B with same TAP curriculum
- Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style)
- Multi-epoch training with data decontamination between epochs
- Progressive context extension: 2K β†’ 4K β†’ 8K β†’ 16K