flint-1.2B / SCALING_GUIDE.md
tekkmaven's picture
Upload SCALING_GUIDE.md
756b84c verified

Flint-1.2B: Scaling Strategy Guide

Deciding how to spend your TPU hours

Decision Matrix

TPU Hours Strategy Config Tokens Key Focus
17h 3-stage TAP flint_17h.yaml ~6.5B Foundation + early reasoning
40h 4-stage TAP flint_40h.yaml ~15B Balanced reasoning + agency
60h 5-stage TAP flint_60h.yaml ~23B Deep capability crystallization
100h+ 6-stage + context ext flint_100h.yaml ~40B Full + 8K context

What Each Extra Hour Buys You

17h β†’ 40h (+23h)

  • +8.5B tokens (2.3Γ— total)
  • Adds dedicated "Polish" stage for balanced skill refinement
  • Full Orca-AgentInstruct (1M examples vs subset)
  • FineMath-3plus gets proper multi-epoch exposure
  • Expected gain: +5-8 MMLU, +8-12 GSM8K

40h β†’ 60h (+20h)

  • +8B tokens (1.5Γ— over 40h)
  • Adds dedicated "Agency" stage focused on tool mastery
  • 5th "Crystal" annealing stage with premium data only
  • Multi-language code (10+ languages vs 5)
  • Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use

60h β†’ 100h+ (+40h)

  • +17B tokens (1.7Γ— over 60h)
  • Context extension to 8K (final 10h)
  • arXiv + Wikipedia for factual grounding
  • Maximum overtraining β†’ near Chinchilla-optimal
  • Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context

Scaling Principles (from literature)

  1. Overtraining works reliably (arxiv:2403.08540)

    • Performance scales predictably up to 640Γ— tokens/param
    • Diminishing returns but never negative
  2. Data repetition is safe up to ~4 epochs (Muennighoff et al.)

    • OpenThoughts-114k can be repeated 5Γ— safely
    • Large web data (FineWeb-Edu, DCLM) should not repeat
  3. Quality annealing gives outsized returns (SmolLM2, OLMo-2)

    • Final 10-15% of training with best data = biggest benchmark jumps
    • FineMath-4+ and hard reasoning problems during LR decay
  4. Inference-optimal training favors overtraining (arxiv:2401.00448)

    • For deployment (inference cost matters): train smaller model longer
    • 1.2B Γ— 23B tokens beats 1.7B Γ— 16B tokens for inference efficiency
  5. Muon gives ~2Γ— sample efficiency (arxiv:2502.16982)

    • Your 23B tokens with Muon β‰ˆ 46B tokens with AdamW
    • Stacks with quality data β†’ multiplicative gains

Multi-Session Management

For runs spanning multiple Kaggle sessions:

# Session 1: Start training
python train_flint.py --config configs/flint_60h.yaml

# Session 2: Auto-resumes from latest checkpoint
python train_flint.py --config configs/flint_60h.yaml --resume

# Resume from specific step (if latest is corrupted)
python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000

# List available checkpoints
python train_flint.py --config configs/flint_60h.yaml --list_checkpoints

Post-Training (after pretraining is complete)

Recommended sequence:

  1. SFT (2-4h): SmolTalk full dataset
  2. GRPO (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe)
  3. Tool DPO (2-4h): Correct vs incorrect tool-call pairs
  4. Quantization: INT4 AWQ for edge deployment

If you have unlimited compute:

  • Scale to 1.7B with same TAP curriculum
  • Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style)
  • Multi-epoch training with data decontamination between epochs
  • Progressive context extension: 2K β†’ 4K β†’ 8K β†’ 16K