🔥 Flint-1.2B — Reasoning-Native Agentic Small Language Model

"Strikes sparks from nothing. Small, sharp, starts fires."

Flint-1.2B is a small language model pretrained from scratch with a novel training methodology called Thought-Action Pretraining (TAP) — the model learns to reason internally (<think>) and use tools (<tool_call>) as part of its base language modeling objective, not as a post-training bolt-on.

🎯 The Edge

Most models learn reasoning in post-training (RLHF/GRPO). Most learn tool use in fine-tuning. Flint learns both during pretraining. Thinking and acting are as natural to it as punctuation.

Capability How It's Trained Reference
<think>...</think> internal reasoning LM objective on 114K reasoning traces from step 0 Phi-4-Mini-Reasoning, Quiet-STaR
<tool_call>...</tool_call> tool use Native vocabulary, trained alongside prose SmolTalk/APIGen
<act>...<observe>... agent loops Multi-step trajectories as standard text Orca-AgentInstruct
Think-before-act pattern Model sees thinking precedes better actions DeepSeek-R1

📐 Architecture

Parameter Value
Parameters 1.24B
Hidden dim 2048
FFN dim (SwiGLU) 5504
Layers 22
Attention heads 32
KV heads (GQA) 4
Activation SwiGLU
Normalization RMSNorm (pre-norm)
Position encoding RoPE (θ=500,000)
Vocab size 49,216
Context length 2048
Embedding tying Yes

Design philosophy: Wide-and-shallow (22 layers × 5504 FFN vs typical 24 × 4096). Wider MLPs store more factual knowledge per parameter — critical for small models (1.5-Pints, arxiv:2408.03506).

⚡ Training

Hardware

  • Kaggle TPU v5e-8 (8 chips, 128GB HBM total, ~1,576 TFLOPS BF16)
  • Framework: JAX/SPMD with Levanter-style training loop
  • Precision: BF16 native

Optimizer: Muon + AdamW Hybrid

Per Moonlight (arxiv:2502.16982) — achieves ~2× sample efficiency over AdamW alone:

Component Optimizer Settings
Weight matrices (Q/K/V/O, FFN) Muon momentum=0.95, 5 Newton-Schulz iters, wd=0.05
Embeddings, norms, LM head AdamW β₁=0.9, β₂=0.95, wd=0.1

Learning Rate: WSD (Warmup-Stable-Decay)

  • Warmup: 2,000 steps (linear 0 → 5e-4)
  • Stable: constant 5e-4
  • Decay: linear to 0 over final 13-15% of training

Training Strategies

This repo contains configs for multiple compute budgets:

Strategy Hours Tokens Config File
🏃 Quick (17h) 17h ~6.5B configs/flint_17h.yaml
🏋️ Standard (40h) 40h ~15B configs/flint_40h.yaml
🚀 Full (60h) 60h ~23B configs/flint_60h.yaml
🔬 Research (100h+) 100h+ ~38B+ configs/flint_100h.yaml

📚 Data: Thought-Action Pretraining (TAP) Curriculum

17-Hour Recipe (3 Stages, ~6.5B tokens)

Stage 1 — Foundation (0–55%):

Source % Dataset
FineWeb-Edu (score ≥ 3) 45% HuggingFaceFW/fineweb-edu
DCLM-Baseline 30% mlfoundations/dclm-baseline-1.0-parquet
Cosmopedia-v2 8% HuggingFaceTB/smollm-corpus cosmopedia-v2
OpenThoughts-114k 🧠 10% open-thoughts/OpenThoughts-114k
SmolTalk apigen-80k 🔧 5% HuggingFaceTB/smoltalk apigen-80k
Python-Edu 2% HuggingFaceTB/smollm-corpus python-edu

Stage 2 — Capability Injection (55–85%):

Source % Dataset
Web blend (FW-Edu + DCLM) 35% same
OpenThoughts 🧠 18% same (2nd epoch)
Orca-AgentInstruct 🤖 12% microsoft/orca-agentinstruct-1M-v1
FineMath-3plus 12% HuggingFaceTB/finemath finemath-3plus
SmolTalk apigen 🔧 8% same (2nd epoch)
OpenWebMath 5% open-web-math/open-web-math
Multi-lang code 10% StarCoderData

Stage 3 — Annealing (85–100%, during LR decay):

Source % Dataset
FineMath-4plus 25% HuggingFaceTB/finemath finemath-4plus
Hard reasoning 🧠 20% OpenThoughts (math/code subset)
FineWeb-Edu score ≥ 4 15% filtered
AgentInstruct analytical 🤖 15% analytical_reasoning + code_
SmolTalk full 10% all config
DCLM top (fasttext > 0.5) 10% filtered
Code (Python + TS) 5% Stack-Edu

60-Hour Recipe (5 Stages, ~23B tokens)

See configs/flint_60h.yaml for the full 5-stage curriculum adding:

  • Stage 3: Agency — Heavy tool + planning focus
  • Stage 4: Polish — Balanced excellence across all capabilities
  • Stage 5: Crystal — Highest-quality annealing

Additional datasets unlocked at 60h:

  • allenai/dolmino-mix-1124 (OLMo2's mid-training mix)
  • Wikipedia + Wikibooks
  • bigcode/the-stack-v2-train-smol-ids (15 languages)
  • arXiv papers (peS2o)

🧠 TAP Data Formatting

Reasoning traces:

<think>
Let me break this problem down step by step.
The equation is 2x + 5 = 13.
Subtract 5: 2x = 8. Divide by 2: x = 4.
Verify: 2(4) + 5 = 13 ✓
</think>

The solution is x = 4.

Tool calls:

User: What's the weather in Tokyo?

<think>
I need real-time data. Let me call the weather API.
</think>

<tool_call>
{"name": "get_weather", "arguments": {"city": "Tokyo"}}
</tool_call>

<tool_response>
{"temperature": 22, "condition": "partly cloudy"}
</tool_response>

It's 22°C and partly cloudy in Tokyo.

Agentic loops:

<think>
I need to: 1) Search for the paper 2) Read key sections 3) Summarize
</think>

<act>search_papers("quantum error correction 2024")</act>
<observe>Found: [1] "Threshold-free QEC"...</observe>

<think>
Paper [1] is most relevant. Let me read the methods section.
</think>

<act>read_paper("arxiv:2401.xxxxx", section="methods")</act>
<observe>The approach uses surface codes with...</observe>

Here's my summary: ...

🔄 Checkpointing

  • Save every 200 steps (~35 min of training)
  • Keep last 5 on disk
  • Push to Hub every 1000 steps for disaster recovery
  • Full state: weights + optimizer + RNG + data position
  • Async saving via Orbax (non-blocking)

Designed for Kaggle's session limits — training resumes automatically from latest checkpoint.

📈 Expected Performance

Benchmark 17h (6.5B tok) 60h (23B tok) SmolLM2-1.7B (11T) Llama-3.2-1B
MMLU 34-38% 42-47% 51.7% 32.2%
HellaSwag 53-58% 62-66% 68.7% 61.6%
ARC-Challenge 38-43% 48-53% 57.1% 41.6%
GSM8K (CoT) 12-18% 22-30% 31.0% 6.5%
HumanEval 15-22% 25-32% 23.2% ~12%
Tool-use 🔧 45-55% 60-70% N/A N/A
Agentic 🤖 40-50% 50-60% N/A N/A

🚀 Quick Start

# Training (on Kaggle TPU v5e-8)
!pip install jax[tpu] -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
!pip install optax orbax-checkpoint transformers datasets flax

# Run training with auto-resume
python train_flint.py --config configs/flint_60h.yaml

# Resume from checkpoint (automatic)
python train_flint.py --config configs/flint_60h.yaml --resume

🗺️ Post-Training Roadmap

Phase Method Goal
SFT SmolTalk full Instruction following
GRPO DeepSeek-R1 recipe Self-improving reasoning
Tool DPO Correct vs incorrect tool calls Reliable tool use
Context extension RoPE scaling 8K → 16K context
Quantization INT4 GPTQ/AWQ Edge deployment

📄 Citation

@misc{flint2025,
  title={Flint-1.2B: Reasoning-Native Agentic Small Language Model},
  author={tekkmaven},
  year={2025},
  note={Thought-Action Pretraining (TAP) on Kaggle TPU v5e-8}
}

📖 Key References

  • SmolLM2 (arxiv:2502.02737) — Data curriculum & mixing ratios
  • Moonlight/Muon (arxiv:2502.16982) — 2× sample-efficient optimizer
  • 1.5-Pints (arxiv:2408.03506) — Quality-focused pretraining on limited compute
  • DCLM (arxiv:2406.11794) — Classifier-filtered web data
  • OLMo-2 (arxiv:2501.00656) — WSD schedule + mid-training annealing
  • DeepSeek-R1 (arxiv:2501.12948) — Reasoning via <think> tags
  • Quiet-STaR (arxiv:2403.09629) — Internal rationale generation
  • Phi-4-Mini-Reasoning (arxiv:2504.21233) — Distillation-as-midtraining
  • Beyond Chinchilla (arxiv:2401.00448) — Inference-optimal overtraining
  • Overtraining Scaling (arxiv:2403.08540) — Reliable overtrained scaling laws

License

Apache 2.0

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "tekkmaven/flint-1.2B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month
511
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train tekkmaven/flint-1.2B

Papers for tekkmaven/flint-1.2B