🔥 Flint-1.2B — Reasoning-Native Agentic Small Language Model

"Strikes sparks from nothing. Small, sharp, starts fires."

Flint-1.2B is a small language model pretrained from scratch with a novel training methodology called Thought-Action Pretraining (TAP) — the model learns to reason internally (<think>) and use tools (<tool_call>) as part of its base language modeling objective, not as a post-training bolt-on.

🎯 The Edge

Most models learn reasoning in post-training (RLHF/GRPO). Most learn tool use in fine-tuning. Flint learns both during pretraining. Thinking and acting are as natural to it as punctuation.

Capability	How It's Trained	Reference
`<think>...</think>` internal reasoning	LM objective on 114K reasoning traces from step 0	Phi-4-Mini-Reasoning, Quiet-STaR
`<tool_call>...</tool_call>` tool use	Native vocabulary, trained alongside prose	SmolTalk/APIGen
`<act>...<observe>...` agent loops	Multi-step trajectories as standard text	Orca-AgentInstruct
Think-before-act pattern	Model sees thinking precedes better actions	DeepSeek-R1

📐 Architecture

Parameter	Value
Parameters	1.24B
Hidden dim	2048
FFN dim (SwiGLU)	5504
Layers	22
Attention heads	32
KV heads (GQA)	4
Activation	SwiGLU
Normalization	RMSNorm (pre-norm)
Position encoding	RoPE (θ=500,000)
Vocab size	49,216
Context length	2048
Embedding tying	Yes

Design philosophy: Wide-and-shallow (22 layers × 5504 FFN vs typical 24 × 4096). Wider MLPs store more factual knowledge per parameter — critical for small models (1.5-Pints, arxiv:2408.03506).

⚡ Training

Hardware

Kaggle TPU v5e-8 (8 chips, 128GB HBM total, ~1,576 TFLOPS BF16)
Framework: JAX/SPMD with Levanter-style training loop
Precision: BF16 native

Optimizer: Muon + AdamW Hybrid

Per Moonlight (arxiv:2502.16982) — achieves ~2× sample efficiency over AdamW alone:

Component	Optimizer	Settings
Weight matrices (Q/K/V/O, FFN)	Muon	momentum=0.95, 5 Newton-Schulz iters, wd=0.05
Embeddings, norms, LM head	AdamW	β₁=0.9, β₂=0.95, wd=0.1

Learning Rate: WSD (Warmup-Stable-Decay)

Warmup: 2,000 steps (linear 0 → 5e-4)
Stable: constant 5e-4
Decay: linear to 0 over final 13-15% of training

Training Strategies

This repo contains configs for multiple compute budgets:

Strategy	Hours	Tokens	Config File
🏃 Quick (17h)	17h	~6.5B	`configs/flint_17h.yaml`
🏋️ Standard (40h)	40h	~15B	`configs/flint_40h.yaml`
🚀 Full (60h)	60h	~23B	`configs/flint_60h.yaml`
🔬 Research (100h+)	100h+	~38B+	`configs/flint_100h.yaml`

📚 Data: Thought-Action Pretraining (TAP) Curriculum

17-Hour Recipe (3 Stages, ~6.5B tokens)

Stage 1 — Foundation (0–55%):

Source	%	Dataset
FineWeb-Edu (score ≥ 3)	45%	`HuggingFaceFW/fineweb-edu`
DCLM-Baseline	30%	`mlfoundations/dclm-baseline-1.0-parquet`
Cosmopedia-v2	8%	`HuggingFaceTB/smollm-corpus` cosmopedia-v2
OpenThoughts-114k 🧠	10%	`open-thoughts/OpenThoughts-114k`
SmolTalk apigen-80k 🔧	5%	`HuggingFaceTB/smoltalk` apigen-80k
Python-Edu	2%	`HuggingFaceTB/smollm-corpus` python-edu

Stage 2 — Capability Injection (55–85%):

Source	%	Dataset
Web blend (FW-Edu + DCLM)	35%	same
OpenThoughts 🧠	18%	same (2nd epoch)
Orca-AgentInstruct 🤖	12%	`microsoft/orca-agentinstruct-1M-v1`
FineMath-3plus	12%	`HuggingFaceTB/finemath` finemath-3plus
SmolTalk apigen 🔧	8%	same (2nd epoch)
OpenWebMath	5%	`open-web-math/open-web-math`
Multi-lang code	10%	StarCoderData

Stage 3 — Annealing (85–100%, during LR decay):

Source	%	Dataset
FineMath-4plus	25%	`HuggingFaceTB/finemath` finemath-4plus
Hard reasoning 🧠	20%	OpenThoughts (math/code subset)
FineWeb-Edu score ≥ 4	15%	filtered
AgentInstruct analytical 🤖	15%	analytical_reasoning + code_
SmolTalk full	10%	all config
DCLM top (fasttext > 0.5)	10%	filtered
Code (Python + TS)	5%	Stack-Edu

60-Hour Recipe (5 Stages, ~23B tokens)

See configs/flint_60h.yaml for the full 5-stage curriculum adding:

Stage 3: Agency — Heavy tool + planning focus
Stage 4: Polish — Balanced excellence across all capabilities
Stage 5: Crystal — Highest-quality annealing

Additional datasets unlocked at 60h:

allenai/dolmino-mix-1124 (OLMo2's mid-training mix)
Wikipedia + Wikibooks
bigcode/the-stack-v2-train-smol-ids (15 languages)
arXiv papers (peS2o)

🧠 TAP Data Formatting

Reasoning traces:

<think>
Let me break this problem down step by step.
The equation is 2x + 5 = 13.
Subtract 5: 2x = 8. Divide by 2: x = 4.
Verify: 2(4) + 5 = 13 ✓
</think>

The solution is x = 4.

Tool calls:

User: What's the weather in Tokyo?

<think>
I need real-time data. Let me call the weather API.
</think>

<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call>

<tool_response>
{"temperature": 22, "condition": "partly cloudy"}
</tool_response>

It's 22°C and partly cloudy in Tokyo.

Agentic loops:

<think>
I need to: 1) Search for the paper 2) Read key sections 3) Summarize
</think>

<act>search_papers("quantum error correction 2024")</act>
<observe>Found: [1] "Threshold-free QEC"...</observe>

<think>
Paper [1] is most relevant. Let me read the methods section.
</think>

<act>read_paper("arxiv:2401.xxxxx", section="methods")</act>
<observe>The approach uses surface codes with...</observe>

Here's my summary: ...

🔄 Checkpointing

Save every 200 steps (~35 min of training)
Keep last 5 on disk
Push to Hub every 1000 steps for disaster recovery
Full state: weights + optimizer + RNG + data position
Async saving via Orbax (non-blocking)

Designed for Kaggle's session limits — training resumes automatically from latest checkpoint.

📈 Expected Performance

Benchmark	17h (6.5B tok)	60h (23B tok)	SmolLM2-1.7B (11T)	Llama-3.2-1B
MMLU	34-38%	42-47%	51.7%	32.2%
HellaSwag	53-58%	62-66%	68.7%	61.6%
ARC-Challenge	38-43%	48-53%	57.1%	41.6%
GSM8K (CoT)	12-18%	22-30%	31.0%	6.5%
HumanEval	15-22%	25-32%	23.2%	~12%
Tool-use 🔧	45-55%	60-70%	N/A	N/A
Agentic 🤖	40-50%	50-60%	N/A	N/A

🚀 Quick Start

# Training (on Kaggle TPU v5e-8)
!pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
!pip install optax orbax-checkpoint transformers datasets flax

# Run training with auto-resume
python train_flint.py --config configs/flint_60h.yaml

# Resume from checkpoint (automatic)
python train_flint.py --config configs/flint_60h.yaml --resume

🗺️ Post-Training Roadmap

Phase	Method	Goal
SFT	SmolTalk full	Instruction following
GRPO	DeepSeek-R1 recipe	Self-improving reasoning
Tool DPO	Correct vs incorrect tool calls	Reliable tool use
Context extension	RoPE scaling	8K → 16K context
Quantization	INT4 GPTQ/AWQ	Edge deployment

📄 Citation

@misc{flint2025,
  title={Flint-1.2B: Reasoning-Native Agentic Small Language Model},
  author={tekkmaven},
  year={2025},
  note={Thought-Action Pretraining (TAP) on Kaggle TPU v5e-8}
}

📖 Key References

SmolLM2 (arxiv:2502.02737) — Data curriculum & mixing ratios
Moonlight/Muon (arxiv:2502.16982) — 2× sample-efficient optimizer
1.5-Pints (arxiv:2408.03506) — Quality-focused pretraining on limited compute
DCLM (arxiv:2406.11794) — Classifier-filtered web data
OLMo-2 (arxiv:2501.00656) — WSD schedule + mid-training annealing
DeepSeek-R1 (arxiv:2501.12948) — Reasoning via <think> tags
Quiet-STaR (arxiv:2403.09629) — Internal rationale generation
Phi-4-Mini-Reasoning (arxiv:2504.21233) — Distillation-as-midtraining
Beyond Chinchilla (arxiv:2401.00448) — Inference-optimal overtraining
Overtraining Scaling (arxiv:2403.08540) — Reliable overtrained scaling laws

License

Apache 2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tekkmaven/flint-1.2B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: 511

Datasets used to train tekkmaven/flint-1.2B

Papers for tekkmaven/flint-1.2B

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math

Paper • 2504.21233 • Published Apr 30, 2025 • 49

Muon is Scalable for LLM Training

Paper • 2502.16982 • Published Feb 24, 2025 • 12

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Paper • 2502.02737 • Published Feb 4, 2025 • 258

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Paper • 2501.12948 • Published Jan 22, 2025 • 449

2 OLMo 2 Furious

Paper • 2501.00656 • Published Dec 31, 2024 • 22