# Flint-1.2B: Scaling Strategy Guide # Deciding how to spend your TPU hours ## Decision Matrix | TPU Hours | Strategy | Config | Tokens | Key Focus | |-----------|----------|--------|--------|-----------| | **17h** | 3-stage TAP | `flint_17h.yaml` | ~6.5B | Foundation + early reasoning | | **40h** | 4-stage TAP | `flint_40h.yaml` | ~15B | Balanced reasoning + agency | | **60h** | 5-stage TAP | `flint_60h.yaml` | ~23B | Deep capability crystallization | | **100h+** | 6-stage + context ext | `flint_100h.yaml` | ~40B | Full + 8K context | ## What Each Extra Hour Buys You ### 17h → 40h (+23h) - **+8.5B tokens** (2.3× total) - Adds dedicated "Polish" stage for balanced skill refinement - Full Orca-AgentInstruct (1M examples vs subset) - FineMath-3plus gets proper multi-epoch exposure - **Expected gain: +5-8 MMLU, +8-12 GSM8K** ### 40h → 60h (+20h) - **+8B tokens** (1.5× over 40h) - Adds dedicated "Agency" stage focused on tool mastery - 5th "Crystal" annealing stage with premium data only - Multi-language code (10+ languages vs 5) - **Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use** ### 60h → 100h+ (+40h) - **+17B tokens** (1.7× over 60h) - Context extension to 8K (final 10h) - arXiv + Wikipedia for factual grounding - Maximum overtraining → near Chinchilla-optimal - **Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context** ## Scaling Principles (from literature) 1. **Overtraining works reliably** (arxiv:2403.08540) - Performance scales predictably up to 640× tokens/param - Diminishing returns but never negative 2. **Data repetition is safe up to ~4 epochs** (Muennighoff et al.) - OpenThoughts-114k can be repeated 5× safely - Large web data (FineWeb-Edu, DCLM) should not repeat 3. **Quality annealing gives outsized returns** (SmolLM2, OLMo-2) - Final 10-15% of training with best data = biggest benchmark jumps - FineMath-4+ and hard reasoning problems during LR decay 4. **Inference-optimal training favors overtraining** (arxiv:2401.00448) - For deployment (inference cost matters): train smaller model longer - 1.2B × 23B tokens beats 1.7B × 16B tokens for inference efficiency 5. **Muon gives ~2× sample efficiency** (arxiv:2502.16982) - Your 23B tokens with Muon ≈ 46B tokens with AdamW - Stacks with quality data → multiplicative gains ## Multi-Session Management For runs spanning multiple Kaggle sessions: ```bash # Session 1: Start training python train_flint.py --config configs/flint_60h.yaml # Session 2: Auto-resumes from latest checkpoint python train_flint.py --config configs/flint_60h.yaml --resume # Resume from specific step (if latest is corrupted) python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000 # List available checkpoints python train_flint.py --config configs/flint_60h.yaml --list_checkpoints ``` ## Post-Training (after pretraining is complete) ### Recommended sequence: 1. **SFT** (2-4h): SmolTalk full dataset 2. **GRPO** (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe) 3. **Tool DPO** (2-4h): Correct vs incorrect tool-call pairs 4. **Quantization**: INT4 AWQ for edge deployment ### If you have unlimited compute: - Scale to 1.7B with same TAP curriculum - Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style) - Multi-epoch training with data decontamination between epochs - Progressive context extension: 2K → 4K → 8K → 16K