Abstract
Mellum 2 is an open-weight 12B-parameter Mixture-of-Experts language model with 2.5B active parameters per token, specialized in software engineering tasks and optimized for inference efficiency on commodity GPUs.
We present Mellum 2, an open-weight 12B-parameter Mixture-of-Experts (MoE) language model with 2.5B active parameters per token. Mellum 2 is a general-purpose language model specialized in software engineering, spanning code generation and editing, debugging, multi-step reasoning, tool use and function calling, agentic coding, and conversational programming assistance, and it is the successor to the completion-focused 4B dense Mellum model. The architecture builds on the Mixture-of-Experts (64 experts, 8 active) and combines Grouped-Query Attention with 4 KV heads, Sliding Window Attention on three of every four layers, and a single Multi-Token Prediction head that doubles as both an auxiliary pre-training objective and a built-in draft model for speculative decoding; each choice was validated by ablation with inference efficiency on commodity GPUs as a design constraint. Pre-training spans approximately 10.6 trillion tokens through a three-phase curriculum that progressively shifts the mixture from diverse web data toward curated code and mathematical content, optimized with Muon under FP8 hybrid precision and a Warmup-Hold-Decay schedule with linear decay to zero. The pre-trained base is extended to a 128K context window via a layer-selective YaRN and then post-trained in two stages (supervised fine-tuning followed by RLVR), yielding two released variants: an Instruct model that answers directly and a Thinking model that emits an explicit reasoning trace before its final answer. Across code generation, math and reasoning, tool use, knowledge, and safety benchmarks, Mellum 2 is competitive with open-weight baselines in the 4B-14B range while running at the per-token compute of a 2.5B dense model. We release the base, instruct, and thinking checkpoints, together with this report on the architecture decisions, data pipeline, and training recipe behind them, under the Apache 2.0 license.
Community
SOTA, open-source small coding-focused model!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Laguna M.1/XS.2 Technical Report (2026)
- ZAYA1-8B Technical Report (2026)
- HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model (2026)
- The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence (2026)
- Phoenix-VL 1.5 Medium Technical Report (2026)
- VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use (2026)
- Post-Trained MoE Can Skip Half Experts via Self-Distillation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/6a7cb5fb-7c6a-41ee-9c5e-73202eab67d7
Generated automatically by ResearchPod — happy to take feedback from the authors.
Get this paper in your agent:
hf papers read 2605.31268 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 10
JetBrains/Mellum2-12B-A2.5B-Instruct
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper