| # Flint-1.2B: Scaling Strategy Guide |
| # Deciding how to spend your TPU hours |
|
|
| ## Decision Matrix |
|
|
| | TPU Hours | Strategy | Config | Tokens | Key Focus | |
| |-----------|----------|--------|--------|-----------| |
| | **17h** | 3-stage TAP | `flint_17h.yaml` | ~6.5B | Foundation + early reasoning | |
| | **40h** | 4-stage TAP | `flint_40h.yaml` | ~15B | Balanced reasoning + agency | |
| | **60h** | 5-stage TAP | `flint_60h.yaml` | ~23B | Deep capability crystallization | |
| | **100h+** | 6-stage + context ext | `flint_100h.yaml` | ~40B | Full + 8K context | |
|
|
| ## What Each Extra Hour Buys You |
|
|
| ### 17h β 40h (+23h) |
| - **+8.5B tokens** (2.3Γ total) |
| - Adds dedicated "Polish" stage for balanced skill refinement |
| - Full Orca-AgentInstruct (1M examples vs subset) |
| - FineMath-3plus gets proper multi-epoch exposure |
| - **Expected gain: +5-8 MMLU, +8-12 GSM8K** |
|
|
| ### 40h β 60h (+20h) |
| - **+8B tokens** (1.5Γ over 40h) |
| - Adds dedicated "Agency" stage focused on tool mastery |
| - 5th "Crystal" annealing stage with premium data only |
| - Multi-language code (10+ languages vs 5) |
| - **Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use** |
|
|
| ### 60h β 100h+ (+40h) |
| - **+17B tokens** (1.7Γ over 60h) |
| - Context extension to 8K (final 10h) |
| - arXiv + Wikipedia for factual grounding |
| - Maximum overtraining β near Chinchilla-optimal |
| - **Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context** |
|
|
| ## Scaling Principles (from literature) |
|
|
| 1. **Overtraining works reliably** (arxiv:2403.08540) |
| - Performance scales predictably up to 640Γ tokens/param |
| - Diminishing returns but never negative |
|
|
| 2. **Data repetition is safe up to ~4 epochs** (Muennighoff et al.) |
| - OpenThoughts-114k can be repeated 5Γ safely |
| - Large web data (FineWeb-Edu, DCLM) should not repeat |
|
|
| 3. **Quality annealing gives outsized returns** (SmolLM2, OLMo-2) |
| - Final 10-15% of training with best data = biggest benchmark jumps |
| - FineMath-4+ and hard reasoning problems during LR decay |
|
|
| 4. **Inference-optimal training favors overtraining** (arxiv:2401.00448) |
| - For deployment (inference cost matters): train smaller model longer |
| - 1.2B Γ 23B tokens beats 1.7B Γ 16B tokens for inference efficiency |
|
|
| 5. **Muon gives ~2Γ sample efficiency** (arxiv:2502.16982) |
| - Your 23B tokens with Muon β 46B tokens with AdamW |
| - Stacks with quality data β multiplicative gains |
|
|
| ## Multi-Session Management |
|
|
| For runs spanning multiple Kaggle sessions: |
|
|
| ```bash |
| # Session 1: Start training |
| python train_flint.py --config configs/flint_60h.yaml |
| |
| # Session 2: Auto-resumes from latest checkpoint |
| python train_flint.py --config configs/flint_60h.yaml --resume |
| |
| # Resume from specific step (if latest is corrupted) |
| python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000 |
| |
| # List available checkpoints |
| python train_flint.py --config configs/flint_60h.yaml --list_checkpoints |
| ``` |
|
|
| ## Post-Training (after pretraining is complete) |
|
|
| ### Recommended sequence: |
| 1. **SFT** (2-4h): SmolTalk full dataset |
| 2. **GRPO** (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe) |
| 3. **Tool DPO** (2-4h): Correct vs incorrect tool-call pairs |
| 4. **Quantization**: INT4 AWQ for edge deployment |
|
|
| ### If you have unlimited compute: |
| - Scale to 1.7B with same TAP curriculum |
| - Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style) |
| - Multi-epoch training with data decontamination between epochs |
| - Progressive context extension: 2K β 4K β 8K β 16K |
|
|