- 🔥 Flint-1.2B — Reasoning-Native Agentic Small Language Model
🔥 Flint-1.2B — Reasoning-Native Agentic Small Language Model
"Strikes sparks from nothing. Small, sharp, starts fires."
Flint-1.2B is a small language model pretrained from scratch with a novel training methodology called Thought-Action Pretraining (TAP) — the model learns to reason internally (<think>) and use tools (<tool_call>) as part of its base language modeling objective, not as a post-training bolt-on.
🎯 The Edge
Most models learn reasoning in post-training (RLHF/GRPO). Most learn tool use in fine-tuning. Flint learns both during pretraining. Thinking and acting are as natural to it as punctuation.
| Capability | How It's Trained | Reference |
|---|---|---|
<think>...</think> internal reasoning |
LM objective on 114K reasoning traces from step 0 | Phi-4-Mini-Reasoning, Quiet-STaR |
<tool_call>...</tool_call> tool use |
Native vocabulary, trained alongside prose | SmolTalk/APIGen |
<act>...<observe>... agent loops |
Multi-step trajectories as standard text | Orca-AgentInstruct |
| Think-before-act pattern | Model sees thinking precedes better actions | DeepSeek-R1 |
📐 Architecture
| Parameter | Value |
|---|---|
| Parameters | 1.24B |
| Hidden dim | 2048 |
| FFN dim (SwiGLU) | 5504 |
| Layers | 22 |
| Attention heads | 32 |
| KV heads (GQA) | 4 |
| Activation | SwiGLU |
| Normalization | RMSNorm (pre-norm) |
| Position encoding | RoPE (θ=500,000) |
| Vocab size | 49,216 |
| Context length | 2048 |
| Embedding tying | Yes |
Design philosophy: Wide-and-shallow (22 layers × 5504 FFN vs typical 24 × 4096). Wider MLPs store more factual knowledge per parameter — critical for small models (1.5-Pints, arxiv:2408.03506).
⚡ Training
Hardware
- Kaggle TPU v5e-8 (8 chips, 128GB HBM total, ~1,576 TFLOPS BF16)
- Framework: JAX/SPMD with Levanter-style training loop
- Precision: BF16 native
Optimizer: Muon + AdamW Hybrid
Per Moonlight (arxiv:2502.16982) — achieves ~2× sample efficiency over AdamW alone:
| Component | Optimizer | Settings |
|---|---|---|
| Weight matrices (Q/K/V/O, FFN) | Muon | momentum=0.95, 5 Newton-Schulz iters, wd=0.05 |
| Embeddings, norms, LM head | AdamW | β₁=0.9, β₂=0.95, wd=0.1 |
Learning Rate: WSD (Warmup-Stable-Decay)
- Warmup: 2,000 steps (linear 0 → 5e-4)
- Stable: constant 5e-4
- Decay: linear to 0 over final 13-15% of training
Training Strategies
This repo contains configs for multiple compute budgets:
| Strategy | Hours | Tokens | Config File |
|---|---|---|---|
| 🏃 Quick (17h) | 17h | ~6.5B | configs/flint_17h.yaml |
| 🏋️ Standard (40h) | 40h | ~15B | configs/flint_40h.yaml |
| 🚀 Full (60h) | 60h | ~23B | configs/flint_60h.yaml |
| 🔬 Research (100h+) | 100h+ | ~38B+ | configs/flint_100h.yaml |
📚 Data: Thought-Action Pretraining (TAP) Curriculum
17-Hour Recipe (3 Stages, ~6.5B tokens)
Stage 1 — Foundation (0–55%):
| Source | % | Dataset |
|---|---|---|
| FineWeb-Edu (score ≥ 3) | 45% | HuggingFaceFW/fineweb-edu |
| DCLM-Baseline | 30% | mlfoundations/dclm-baseline-1.0-parquet |
| Cosmopedia-v2 | 8% | HuggingFaceTB/smollm-corpus cosmopedia-v2 |
| OpenThoughts-114k 🧠 | 10% | open-thoughts/OpenThoughts-114k |
| SmolTalk apigen-80k 🔧 | 5% | HuggingFaceTB/smoltalk apigen-80k |
| Python-Edu | 2% | HuggingFaceTB/smollm-corpus python-edu |
Stage 2 — Capability Injection (55–85%):
| Source | % | Dataset |
|---|---|---|
| Web blend (FW-Edu + DCLM) | 35% | same |
| OpenThoughts 🧠 | 18% | same (2nd epoch) |
| Orca-AgentInstruct 🤖 | 12% | microsoft/orca-agentinstruct-1M-v1 |
| FineMath-3plus | 12% | HuggingFaceTB/finemath finemath-3plus |
| SmolTalk apigen 🔧 | 8% | same (2nd epoch) |
| OpenWebMath | 5% | open-web-math/open-web-math |
| Multi-lang code | 10% | StarCoderData |
Stage 3 — Annealing (85–100%, during LR decay):
| Source | % | Dataset |
|---|---|---|
| FineMath-4plus | 25% | HuggingFaceTB/finemath finemath-4plus |
| Hard reasoning 🧠 | 20% | OpenThoughts (math/code subset) |
| FineWeb-Edu score ≥ 4 | 15% | filtered |
| AgentInstruct analytical 🤖 | 15% | analytical_reasoning + code_ |
| SmolTalk full | 10% | all config |
| DCLM top (fasttext > 0.5) | 10% | filtered |
| Code (Python + TS) | 5% | Stack-Edu |
60-Hour Recipe (5 Stages, ~23B tokens)
See configs/flint_60h.yaml for the full 5-stage curriculum adding:
- Stage 3: Agency — Heavy tool + planning focus
- Stage 4: Polish — Balanced excellence across all capabilities
- Stage 5: Crystal — Highest-quality annealing
Additional datasets unlocked at 60h:
allenai/dolmino-mix-1124(OLMo2's mid-training mix)- Wikipedia + Wikibooks
bigcode/the-stack-v2-train-smol-ids(15 languages)- arXiv papers (peS2o)
🧠 TAP Data Formatting
Reasoning traces:
<think>
Let me break this problem down step by step.
The equation is 2x + 5 = 13.
Subtract 5: 2x = 8. Divide by 2: x = 4.
Verify: 2(4) + 5 = 13 ✓
</think>
The solution is x = 4.
Tool calls:
User: What's the weather in Tokyo?
<think>
I need real-time data. Let me call the weather API.
</think>
<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call>
<tool_response>
{"temperature": 22, "condition": "partly cloudy"}
</tool_response>
It's 22°C and partly cloudy in Tokyo.
Agentic loops:
<think>
I need to: 1) Search for the paper 2) Read key sections 3) Summarize
</think>
<act>search_papers("quantum error correction 2024")</act>
<observe>Found: [1] "Threshold-free QEC"...</observe>
<think>
Paper [1] is most relevant. Let me read the methods section.
</think>
<act>read_paper("arxiv:2401.xxxxx", section="methods")</act>
<observe>The approach uses surface codes with...</observe>
Here's my summary: ...
🔄 Checkpointing
- Save every 200 steps (~35 min of training)
- Keep last 5 on disk
- Push to Hub every 1000 steps for disaster recovery
- Full state: weights + optimizer + RNG + data position
- Async saving via Orbax (non-blocking)
Designed for Kaggle's session limits — training resumes automatically from latest checkpoint.
📈 Expected Performance
| Benchmark | 17h (6.5B tok) | 60h (23B tok) | SmolLM2-1.7B (11T) | Llama-3.2-1B |
|---|---|---|---|---|
| MMLU | 34-38% | 42-47% | 51.7% | 32.2% |
| HellaSwag | 53-58% | 62-66% | 68.7% | 61.6% |
| ARC-Challenge | 38-43% | 48-53% | 57.1% | 41.6% |
| GSM8K (CoT) | 12-18% | 22-30% | 31.0% | 6.5% |
| HumanEval | 15-22% | 25-32% | 23.2% | ~12% |
| Tool-use 🔧 | 45-55% | 60-70% | N/A | N/A |
| Agentic 🤖 | 40-50% | 50-60% | N/A | N/A |
🚀 Quick Start
# Training (on Kaggle TPU v5e-8)
!pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
!pip install optax orbax-checkpoint transformers datasets flax
# Run training with auto-resume
python train_flint.py --config configs/flint_60h.yaml
# Resume from checkpoint (automatic)
python train_flint.py --config configs/flint_60h.yaml --resume
🗺️ Post-Training Roadmap
| Phase | Method | Goal |
|---|---|---|
| SFT | SmolTalk full | Instruction following |
| GRPO | DeepSeek-R1 recipe | Self-improving reasoning |
| Tool DPO | Correct vs incorrect tool calls | Reliable tool use |
| Context extension | RoPE scaling | 8K → 16K context |
| Quantization | INT4 GPTQ/AWQ | Edge deployment |
📄 Citation
@misc{flint2025,
title={Flint-1.2B: Reasoning-Native Agentic Small Language Model},
author={tekkmaven},
year={2025},
note={Thought-Action Pretraining (TAP) on Kaggle TPU v5e-8}
}
📖 Key References
- SmolLM2 (arxiv:2502.02737) — Data curriculum & mixing ratios
- Moonlight/Muon (arxiv:2502.16982) — 2× sample-efficient optimizer
- 1.5-Pints (arxiv:2408.03506) — Quality-focused pretraining on limited compute
- DCLM (arxiv:2406.11794) — Classifier-filtered web data
- OLMo-2 (arxiv:2501.00656) — WSD schedule + mid-training annealing
- DeepSeek-R1 (arxiv:2501.12948) — Reasoning via
<think>tags - Quiet-STaR (arxiv:2403.09629) — Internal rationale generation
- Phi-4-Mini-Reasoning (arxiv:2504.21233) — Distillation-as-midtraining
- Beyond Chinchilla (arxiv:2401.00448) — Inference-optimal overtraining
- Overtraining Scaling (arxiv:2403.08540) — Reliable overtrained scaling laws
License
Apache 2.0
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "tekkmaven/flint-1.2B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.
- Downloads last month
- 511