# Flint-1.2B: Scaling Strategy Guide
# Deciding how to spend your TPU hours

## Decision Matrix

| TPU Hours | Strategy | Config | Tokens | Key Focus |
|-----------|----------|--------|--------|-----------|
| **17h** | 3-stage TAP | `flint_17h.yaml` | ~6.5B | Foundation + early reasoning |
| **40h** | 4-stage TAP | `flint_40h.yaml` | ~15B | Balanced reasoning + agency |
| **60h** | 5-stage TAP | `flint_60h.yaml` | ~23B | Deep capability crystallization |
| **100h+** | 6-stage + context ext | `flint_100h.yaml` | ~40B | Full + 8K context |

## What Each Extra Hour Buys You

### 17h → 40h (+23h)
- **+8.5B tokens** (2.3× total)
- Adds dedicated "Polish" stage for balanced skill refinement
- Full Orca-AgentInstruct (1M examples vs subset)
- FineMath-3plus gets proper multi-epoch exposure
- **Expected gain: +5-8 MMLU, +8-12 GSM8K**

### 40h → 60h (+20h)  
- **+8B tokens** (1.5× over 40h)
- Adds dedicated "Agency" stage focused on tool mastery
- 5th "Crystal" annealing stage with premium data only
- Multi-language code (10+ languages vs 5)
- **Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use**

### 60h → 100h+ (+40h)
- **+17B tokens** (1.7× over 60h)
- Context extension to 8K (final 10h)
- arXiv + Wikipedia for factual grounding
- Maximum overtraining → near Chinchilla-optimal
- **Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context**

## Scaling Principles (from literature)

1. **Overtraining works reliably** (arxiv:2403.08540)
   - Performance scales predictably up to 640× tokens/param
   - Diminishing returns but never negative

2. **Data repetition is safe up to ~4 epochs** (Muennighoff et al.)
   - OpenThoughts-114k can be repeated 5× safely
   - Large web data (FineWeb-Edu, DCLM) should not repeat

3. **Quality annealing gives outsized returns** (SmolLM2, OLMo-2)
   - Final 10-15% of training with best data = biggest benchmark jumps
   - FineMath-4+ and hard reasoning problems during LR decay

4. **Inference-optimal training favors overtraining** (arxiv:2401.00448)
   - For deployment (inference cost matters): train smaller model longer
   - 1.2B × 23B tokens beats 1.7B × 16B tokens for inference efficiency

5. **Muon gives ~2× sample efficiency** (arxiv:2502.16982)
   - Your 23B tokens with Muon ≈ 46B tokens with AdamW
   - Stacks with quality data → multiplicative gains

## Multi-Session Management

For runs spanning multiple Kaggle sessions:

```bash
# Session 1: Start training
python train_flint.py --config configs/flint_60h.yaml

# Session 2: Auto-resumes from latest checkpoint
python train_flint.py --config configs/flint_60h.yaml --resume

# Resume from specific step (if latest is corrupted)
python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000

# List available checkpoints
python train_flint.py --config configs/flint_60h.yaml --list_checkpoints
```

## Post-Training (after pretraining is complete)

### Recommended sequence:
1. **SFT** (2-4h): SmolTalk full dataset
2. **GRPO** (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe)
3. **Tool DPO** (2-4h): Correct vs incorrect tool-call pairs
4. **Quantization**: INT4 AWQ for edge deployment

### If you have unlimited compute:
- Scale to 1.7B with same TAP curriculum
- Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style)
- Multi-epoch training with data decontamination between epochs
- Progressive context extension: 2K → 4K → 8K → 16K