Flint-1.2B: Scaling Strategy Guide

Deciding how to spend your TPU hours

Decision Matrix

TPU Hours	Strategy	Config	Tokens	Key Focus
17h	3-stage TAP	`flint_17h.yaml`	~6.5B	Foundation + early reasoning
40h	4-stage TAP	`flint_40h.yaml`	~15B	Balanced reasoning + agency
60h	5-stage TAP	`flint_60h.yaml`	~23B	Deep capability crystallization
100h+	6-stage + context ext	`flint_100h.yaml`	~40B	Full + 8K context

What Each Extra Hour Buys You

17h → 40h (+23h)

+8.5B tokens (2.3× total)
Adds dedicated "Polish" stage for balanced skill refinement
Full Orca-AgentInstruct (1M examples vs subset)
FineMath-3plus gets proper multi-epoch exposure
Expected gain: +5-8 MMLU, +8-12 GSM8K

40h → 60h (+20h)

+8B tokens (1.5× over 40h)
Adds dedicated "Agency" stage focused on tool mastery
5th "Crystal" annealing stage with premium data only
Multi-language code (10+ languages vs 5)
Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use

60h → 100h+ (+40h)

+17B tokens (1.7× over 60h)
Context extension to 8K (final 10h)
arXiv + Wikipedia for factual grounding
Maximum overtraining → near Chinchilla-optimal
Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context

Scaling Principles (from literature)

Overtraining works reliably (arxiv:2403.08540)
- Performance scales predictably up to 640× tokens/param
- Diminishing returns but never negative
Data repetition is safe up to ~4 epochs (Muennighoff et al.)
- OpenThoughts-114k can be repeated 5× safely
- Large web data (FineWeb-Edu, DCLM) should not repeat
Quality annealing gives outsized returns (SmolLM2, OLMo-2)
- Final 10-15% of training with best data = biggest benchmark jumps
- FineMath-4+ and hard reasoning problems during LR decay
Inference-optimal training favors overtraining (arxiv:2401.00448)
- For deployment (inference cost matters): train smaller model longer
- 1.2B × 23B tokens beats 1.7B × 16B tokens for inference efficiency
Muon gives ~2× sample efficiency (arxiv:2502.16982)
- Your 23B tokens with Muon ≈ 46B tokens with AdamW
- Stacks with quality data → multiplicative gains

Multi-Session Management

For runs spanning multiple Kaggle sessions:

# Session 1: Start training
python train_flint.py --config configs/flint_60h.yaml

# Session 2: Auto-resumes from latest checkpoint
python train_flint.py --config configs/flint_60h.yaml --resume

# Resume from specific step (if latest is corrupted)
python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000

# List available checkpoints
python train_flint.py --config configs/flint_60h.yaml --list_checkpoints

Post-Training (after pretraining is complete)

Recommended sequence:

SFT (2-4h): SmolTalk full dataset
GRPO (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe)
Tool DPO (2-4h): Correct vs incorrect tool-call pairs
Quantization: INT4 AWQ for edge deployment

If you have unlimited compute:

Scale to 1.7B with same TAP curriculum
Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style)
Multi-epoch training with data decontamination between epochs
Progressive context extension: 2K → 4K → 8K → 16K