Flint-1.2B: Scaling Strategy Guide
Deciding how to spend your TPU hours
Decision Matrix
| TPU Hours | Strategy | Config | Tokens | Key Focus |
|---|---|---|---|---|
| 17h | 3-stage TAP | flint_17h.yaml |
~6.5B | Foundation + early reasoning |
| 40h | 4-stage TAP | flint_40h.yaml |
~15B | Balanced reasoning + agency |
| 60h | 5-stage TAP | flint_60h.yaml |
~23B | Deep capability crystallization |
| 100h+ | 6-stage + context ext | flint_100h.yaml |
~40B | Full + 8K context |
What Each Extra Hour Buys You
17h β 40h (+23h)
- +8.5B tokens (2.3Γ total)
- Adds dedicated "Polish" stage for balanced skill refinement
- Full Orca-AgentInstruct (1M examples vs subset)
- FineMath-3plus gets proper multi-epoch exposure
- Expected gain: +5-8 MMLU, +8-12 GSM8K
40h β 60h (+20h)
- +8B tokens (1.5Γ over 40h)
- Adds dedicated "Agency" stage focused on tool mastery
- 5th "Crystal" annealing stage with premium data only
- Multi-language code (10+ languages vs 5)
- Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use
60h β 100h+ (+40h)
- +17B tokens (1.7Γ over 60h)
- Context extension to 8K (final 10h)
- arXiv + Wikipedia for factual grounding
- Maximum overtraining β near Chinchilla-optimal
- Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context
Scaling Principles (from literature)
Overtraining works reliably (arxiv:2403.08540)
- Performance scales predictably up to 640Γ tokens/param
- Diminishing returns but never negative
Data repetition is safe up to ~4 epochs (Muennighoff et al.)
- OpenThoughts-114k can be repeated 5Γ safely
- Large web data (FineWeb-Edu, DCLM) should not repeat
Quality annealing gives outsized returns (SmolLM2, OLMo-2)
- Final 10-15% of training with best data = biggest benchmark jumps
- FineMath-4+ and hard reasoning problems during LR decay
Inference-optimal training favors overtraining (arxiv:2401.00448)
- For deployment (inference cost matters): train smaller model longer
- 1.2B Γ 23B tokens beats 1.7B Γ 16B tokens for inference efficiency
Muon gives ~2Γ sample efficiency (arxiv:2502.16982)
- Your 23B tokens with Muon β 46B tokens with AdamW
- Stacks with quality data β multiplicative gains
Multi-Session Management
For runs spanning multiple Kaggle sessions:
# Session 1: Start training
python train_flint.py --config configs/flint_60h.yaml
# Session 2: Auto-resumes from latest checkpoint
python train_flint.py --config configs/flint_60h.yaml --resume
# Resume from specific step (if latest is corrupted)
python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000
# List available checkpoints
python train_flint.py --config configs/flint_60h.yaml --list_checkpoints
Post-Training (after pretraining is complete)
Recommended sequence:
- SFT (2-4h): SmolTalk full dataset
- GRPO (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe)
- Tool DPO (2-4h): Correct vs incorrect tool-call pairs
- Quantization: INT4 AWQ for edge deployment
If you have unlimited compute:
- Scale to 1.7B with same TAP curriculum
- Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style)
- Multi-epoch training with data decontamination between epochs
- Progressive context extension: 2K β 4K β 8K β 16K