tekkmaven
/

flint-1.2B

Text Generation

small-language-model

thought-action-pretraining

Model card Files Files and versions

flint-1.2B / SCALING_GUIDE.md

tekkmaven's picture

Upload SCALING_GUIDE.md

756b84c verified 18 days ago

|

history blame contribute delete

3.43 kB

	# Flint-1.2B: Scaling Strategy Guide
	# Deciding how to spend your TPU hours

	## Decision Matrix

	\| TPU Hours \| Strategy \| Config \| Tokens \| Key Focus \|
	\|-----------\|----------\|--------\|--------\|-----------\|
	\| 17h \| 3-stage TAP \| `flint_17h.yaml` \| ~6.5B \| Foundation + early reasoning \|
	\| 40h \| 4-stage TAP \| `flint_40h.yaml` \| ~15B \| Balanced reasoning + agency \|
	\| 60h \| 5-stage TAP \| `flint_60h.yaml` \| ~23B \| Deep capability crystallization \|
	\| 100h+ \| 6-stage + context ext \| `flint_100h.yaml` \| ~40B \| Full + 8K context \|

	## What Each Extra Hour Buys You

	### 17h → 40h (+23h)
	- +8.5B tokens (2.3× total)
	- Adds dedicated "Polish" stage for balanced skill refinement
	- Full Orca-AgentInstruct (1M examples vs subset)
	- FineMath-3plus gets proper multi-epoch exposure
	- Expected gain: +5-8 MMLU, +8-12 GSM8K

	### 40h → 60h (+20h)
	- +8B tokens (1.5× over 40h)
	- Adds dedicated "Agency" stage focused on tool mastery
	- 5th "Crystal" annealing stage with premium data only
	- Multi-language code (10+ languages vs 5)
	- Expected gain: +3-5 MMLU, +5-8 GSM8K, +10-15 tool-use

	### 60h → 100h+ (+40h)
	- +17B tokens (1.7× over 60h)
	- Context extension to 8K (final 10h)
	- arXiv + Wikipedia for factual grounding
	- Maximum overtraining → near Chinchilla-optimal
	- Expected gain: +3-5 MMLU, +3-5 GSM8K, 8K context

	## Scaling Principles (from literature)

	1. Overtraining works reliably (arxiv:2403.08540)
	- Performance scales predictably up to 640× tokens/param
	- Diminishing returns but never negative

	2. Data repetition is safe up to ~4 epochs (Muennighoff et al.)
	- OpenThoughts-114k can be repeated 5× safely
	- Large web data (FineWeb-Edu, DCLM) should not repeat

	3. Quality annealing gives outsized returns (SmolLM2, OLMo-2)
	- Final 10-15% of training with best data = biggest benchmark jumps
	- FineMath-4+ and hard reasoning problems during LR decay

	4. Inference-optimal training favors overtraining (arxiv:2401.00448)
	- For deployment (inference cost matters): train smaller model longer
	- 1.2B × 23B tokens beats 1.7B × 16B tokens for inference efficiency

	5. Muon gives ~2× sample efficiency (arxiv:2502.16982)
	- Your 23B tokens with Muon ≈ 46B tokens with AdamW
	- Stacks with quality data → multiplicative gains

	## Multi-Session Management

	For runs spanning multiple Kaggle sessions:

	```bash
	# Session 1: Start training
	python train_flint.py --config configs/flint_60h.yaml

	# Session 2: Auto-resumes from latest checkpoint
	python train_flint.py --config configs/flint_60h.yaml --resume

	# Resume from specific step (if latest is corrupted)
	python train_flint.py --config configs/flint_60h.yaml --resume --checkpoint_step 5000

	# List available checkpoints
	python train_flint.py --config configs/flint_60h.yaml --list_checkpoints
	```

	## Post-Training (after pretraining is complete)

	### Recommended sequence:
	1. SFT (2-4h): SmolTalk full dataset
	2. GRPO (4-8h): Math/code with verifiable rewards (DeepSeek-R1 recipe)
	3. Tool DPO (2-4h): Correct vs incorrect tool-call pairs
	4. Quantization: INT4 AWQ for edge deployment

	### If you have unlimited compute:
	- Scale to 1.7B with same TAP curriculum
	- Add GRPO phase during pretraining (online RL, DeepSeek-R1-Zero style)
	- Multi-epoch training with data decontamination between epochs
	- Progressive context extension: 2K → 4K → 8K → 16K