Introducing North Mini Code: Cohere’s First Model For Developers

Community Article

Published June 9, 2026

Upvote

Cohere Code Agents Team

coherecode

CohereLabs

All co-authors listed below

Today, we are releasing North Mini Code, a 30B-parameter Mixture-of-Experts model with 3B active parameters with powerful agentic coding capabilities, available on Hugging Face under the Apache 2.0 license.

North Mini Code is the first model in Cohere’s new family of models, and is specifically designed and trained for agentic software engineering tasks.

Figure 1: North Mini Code’s performance in agentic coding tasks and complex code generation benchmarks, compared to leading open-source models of similar size. See here for the details of our benchmarking methodology.

North Mini Code is optimized for complex software engineering workflows, terminal-based agentic tasks, and high-quality code generation. On Artificial Analysis’ Coding Index, North Mini Code achieves a score of 33.4, outperforming Qwen3.5 (35B-A3B), Gemma 4 (26B-A4B), Devstral Small 2 (24B Dense), and even substantially larger models such as Nemotron 3 Super (120B-A12B), Mistral Small 4 (119B-A6B), and Devstral 2 (123B).¹ It ranks among the strongest open-source coding models in its size class.

Try North Mini Code in OpenCode

Real-world code agents depend on model quality and robustness across agent harnesses. We trained North Mini Code using multiple scaffolds rather than optimizing for a single one. This approach enables North Mini Code to serve as a reliable foundation for coding agents such as OpenCode.

Architecture

Figure 2: North Mini Code is a Mixture-of-Experts Transformer decoder with interleaved sliding-window self-attention and full self-attention.

North Mini Code is a decoder-only Transformer-based sparse Mixture-of-Experts model. It uses our efficient attention implementation, interleaved between sliding-window attention with RoPE and global attention with no positional embeddings, in a 3:1 ratio [1]. The feed-forward block is an MoE block with 128 experts, of which 8 are activated per token. Each expert block is an FFN block with SwiGLU activation. The router applies a sigmoid activation function to the logits before the top-k selection. We also use a single dense layer before the sparse layers.

Post-Training for Coding Excellence

Figure 3: The post-training pipeline is made up of two phases of supervised fine-tuning (SFT) and a phase of agentic reinforcement learning with verifiable rewards (RLVR) targeting software engineering and terminal tasks.

We post-train North Mini Code using a two-stage cascaded supervised fine-tuning (SFT) followed by reinforcement learning with verifiable rewards (RLVR), focusing on agentic coding. Our first stage SFT data focuses on coding capabilities that are integrated within a wider mix for robustness and usability. The datamix includes programming, reasoning, and instruction following across a large variety of domains where the code datasets correspond to 70% of trainable tokens, 43% agentic tool-use data, and 27% single-turn competitive or scientific programming data. In the second stage SFT, we use a 4.5 billion token data mixture from only agentic and reasoning-driven samples, where code data forms 61% of trainable tokens. This mixture comprises our highest-quality data across coding and wider agentic tasks where tool calls and completions are verified as executable and correct.

Our internal data pipeline heavily relies on containerised agentic coding environments. We maintain a disjoint subset of these environments for use in synthetic SFT data generation and RLVR. The majority are based on software engineering tasks from real-world repositories, while the rest are terminal-based agentic tasks sourced from open-source and internal datasets. In total, we used over 70k verifiable tasks across ~5k unique repositories. We deduplicate our environments against the repository sources from SWE-Bench [2] and SWE-Bench-Pro [3] to avoid source leakage during evaluation [4].

We used 64K and 128K context lengths for the first and second stages of SFT, respectively. This “long-to-longer” cascade approach (similar to [5, 6]) enables bipartite training on valuable shorter data, establishing a robust performance baseline, followed by targeted long-context training only on high-quality verified samples. Without multi-stage training, the 20B non-code tokens during the initial training stage often dominated the 1.5B tokens of high-quality code data in later training, producing poorer performance and higher behavioral conflicts from data trends differing between stages. Anecdotally, training on a near-complete length distribution of samples produced shorter final trajectories during evaluation than training on a truncated distribution up to 64K only.

Instead of optimising North Mini Code towards quantitative metrics during SFT, we adopted an approach strictly using SFT as priming for RLVR. The data mixture optimises sampling diversity and pass@K (for high K) in downstream stages. We use sample-level filtering to remove any pathologies such as invalid tool calls, erroneous whitespace generation, malformed special tokens, or hallucinated citations. Artifacts or hyperparameters producing undesirable RLVR behaviours (e.g., low entropy, invalid structured generations) were pruned via ablations. The final SFT model achieves 80.2% pass@10 on SWE-Bench Verified [2] and 55.1% pass@10 on Terminal-Bench v2 [7].

Robustness Across Harnesses

Harness robustness improves model usability in realistic software development settings, where agents encounter diverse and unpredictable tooling environments. These environments differ not just in prompting but in fundamental tool-use modality, For instance, SWE-Agent [8] exposes a relatively rich agent-CLI interface with specialized commands (bash, str_replace_editor and submit tools) and templated observations; mini-SWE-agent [9] strips this down to a single bash tool, with raw stdout from shell as the only feedback; and OpenCode [10] uses fine-grained individually typed tools (edit, grep, todowrite and task etc) returning structured JSON responses.

Figure 4: To power a variety of agentic coding harnesses, North Mini Code is exposed to a variety of coding harnesses during the second SFT stage.

We address cross-harness generalization by introducing a small amount of additional benchmark harness data (6% of the SFT mix, compared to 50% of the chosen SWE-Agent harness) during the second SFT stage. Specifically, this data mix yields a 10% gain on the evaluation with OpenCode harness while maintaining performance with SWE-Agent on SWE-Bench Verified, demonstrating that cross-harness transfer can be cheaply acquired without degrading benchmark performance. Notably, North-Code-Mini achieves 61.0% pass@1 using mini-SWE-Agent, where the improvement emerged for free in the cross-task, cross-harness settings, suggesting that harnesses with overlapping tool capabilities share enough representational structure for positive transfer. We also observe minimal data conflict when training on hybrid harness data, indicating that skills required by different harnesses are usually complementary rather than contradictory.

Similarly, the official Terminal-Bench uses its own Terminus 2 harness, where all the agent-CLI interactions are communicated via plain-text chat turns (instead of native tool calling). In order to prime our models on Terminus 2, we include a small amount of data (less than 20%) in a plain-text format in the data mixture, which has proved sufficient for the model to naturally generalise across. Interestingly, we also find that it’s crucial to introduce sufficient variations in the various harnesses (akin to data augmentation) in order to force the model to properly establish the link between instructions and behaviours rather than simply regurgitating a fixed template without understanding, and this is especially important when the harnesses appear similar to each other.

Asynchronous RL for Agentic Coding

Coding-agent rollouts are long and highly variable in length, with the slowest trajectories routinely an order of magnitude longer than the median. A synchronous RL loop would idle the trainer waiting for those trials to be generated for every batch, so we decouple sampling from learning: a trainer runs alongside a vLLM sidecar that serves rollouts continuously. Policy weights are exported into vLLM every few learner steps (K=4), so the sampler is at most slightly off-policy at any moment. The residual mismatch is then corrected at the loss level.

To unblock the learner process from waiting on the longest rollouts while simultaneously avoiding a misbalance of data distribution across tasks, we used a windowed First-in-First-Out (FIFO) queue (trainer↔sampler) [11]: a small fraction at the head of the queue is consumed in completion order to drain stragglers, with the rest staying in input order. Empirically, this recovers most of the throughput of a completion-order scheme without measurably hindering training stability.

We train using CISPO [12], a log-likelihood objective with token-level importance sampling correction. CISPO differs from PPO and GRPO in that the importance weight multiplies a log-likelihood rather than a probability ratio and enhances RLOO [13] with stronger regularization. We aggregate the loss at the token level rather than the prompt level, so the gradient signal scales with trajectory length and long agentic traces (where most of the credit-assignment signal lives) are not down-weighted relative to short ones.

A single multi-environment RL train – We run a single multi-environment online RL training run spanning two task environments: Terminal-based tasks and software engineering tasks. Each training batch consists of 512 rollouts with a group size of 8 rollouts sampled per prompt. All rollouts share a global context window of 128K tokens. To account for differing task complexity, each task is assigned a distinct agentic-step budget. These per-task budgets were set based on pass@k filtering performed prior to RLVR, ensuring the budgets are appropriately calibrated to the difficulty of each task distribution. We observe that granting the model a turn budget substantially larger than necessary encourages unnecessary verbosity and hoppiness in its rollouts.

For Terminal-based tasks, we configure the agent with a simple ReAct harness employing a single terminal-use tool based on Harbor's Tmux session implementation [14], whereas for SWE tasks, we employ the SWE-agent [8] harness. Both environments provide the agent with a pre-built Docker image encoding the environment state, a natural language user prompt, and a set of unit tests used for verification. We train on a combination of internal and open-source datasets, filtered to retain only problems with an acceptable pass@k rate, i.e., excluding trivially solved and completely unsolvable instances. We use binary rewards derived from the unit-test-based verifier. In addition, the model receives a reward of 0 for generating invalid tool calls or unparseable outputs, enabling a sharp drop in the rate of hallucinated or malformed tool calls within the first training steps.

Figure 5: The multi-environment RL training run improves model performance on benchmarks like SWE-Bench Verified and Terminal-Bench v2. Learning curves are displayed on the left across the RLVR training process.

Higher performance and robustness with online RL – RLVR training improved the performance of the final model from the SFT initialization by 7.9% (absolute) pass@1 in Terminal-Bench v2 and 3.0% (absolute) in SWE-Bench. We observe that joint training across both environments yields stronger results than training on each separately, and also generalizes better to out-of-distribution tasks. Beyond correctness scores, we observe significant improvements in agent robustness where the RLVR model produces shorter trajectories and fewer invalid or failing tool calls. The final model also exhibits less repetitive tool-call looping, reliably concluding its trajectory by submitting a solution or responding to the user.

Internal Human Evaluation Benchmark

Complementary to existing coding benchmarks, we also developed our own internal benchmark suite to measure model performance on out-of-distribution problems in pairwise evaluation with human annotators. In line with other benchmark setups, we evaluated the iterations of our models harnessed in OpenCode through Harbor. To understand model performance, we benchmark on four distinct functionalities:

Code Explanation: Models are asked to explain particular technical aspects of a given code repository within a README file, or directly to the user.
Code Editing: Models are tasked to implement a feature based on an existing code base.
Data Visualization: Given data samples, models are tasked to create certain visualizations with a particular framework; no additional code is given.
Implementation from Scratch: Given only design specifications and the packages to use, models are tasked to create a project from scratch, focused primarily on front-end design.

Evaluators are provided with rubric-based scoring questions to help them assess individual response criteria and rate individual attempts first, before giving a final preference rating between the two model trajectories.² We share evaluation results of North Mini Code, comparing the SFT checkpoint with the final model release checkpoint.

Figure 6: Pairwise preference results for human evaluation comparing the final North Mini Code checkpoint after RLVR against the SFT-only checkpoint across 85 samples.

Our evaluations show that RLVR especially improves model performance on code editing tasks, resulting in an aggregate win rate of 66.1% across subsets for the final model against its SFT-only counterpart.

Get Started

North Mini Code models are available in OpenCode, Cohere API, and in HuggingFace with BF16 and FP8 (quantized) weights: bf16, fp8

Extended Author List

Code Agents Team and North Mini Code Group:
Jay Alammar, Sophia Althammer, Dennis Aumiller, Leon Engländer, Yannis Flet-Berliac, Eden Gilbert, Sarra Habchi, Kylie He, Dhruti Joshi, Jozef Mokrý, David Mora, Josh Netto-Rosen, Deniz Qian, Lawrence Rodgers, Willem Röpke, Tom Sherborne, Ahmet Üstün, Minjie Xu

Pre-training and Inference Team:
Diana Abagyan, Sammie Bae, Björn Bebensee, Walter Beller-Morales, Sepideh Shaterian Bidgoli, Bas Büller, David Cairuz, Kris Cao, Roman Castagné, Giannis Chatziveroglou, Tim Chung, Felipe Cruz, Rishit Dholakia, Ali Edalati, Nikolas Gritsch, Kilian Haefeli, Prashant Kumar, Simon Lehnerer, Tony Liu, Alex McKinney, Ekagra Ranjan, Dev Shah, Zewen Shen, Sylvie Shi, Dwarak Talupuru, Komal Teru, Robin Vaaler, Bharat Venkitesh, Donglu Wang, Terrence Zhao, Leo Zhou, Conway Zhu

Management and Leadership:
Phil Blunsom, Nick Frosst, Aidan Gomez, Manoj Govindassamy, Nick Jakobi, Patrick Lewis, Acyr Locatelli, Joelle Pineau, Ivan Zhang

Benchmarking Methodology

Our core agentic capabilities are measured using SWE-Bench Verified, SWE-Bench Pro, Terminal-Bench v2, and Terminal-Bench Hard. North-Code-Mini was evaluated, using the Swe-Agent harness v1.1.0 for SWE-Bench, and a simple ReAct harness employing a single terminal-use tool based on Harbor’s Tmux session implementation for Terminal-Bench v2. For Terminal Bench Hard, we directly used Terminus-2, following the same methodology as the Artificial Analysis Intelligence Index to compare North Mini Code with the other models. We follow benchmarks’ official timeout and hardware resource limit settings wherever specified. We additionally track code generation capabilities in SciCode [15], which measures coding performance for scientific problems, and LiveCodeBench v6 [16], which requires strong algorithmic reasoning capabilities for coding performance outside of tool use. We run each benchmark with 3 different seeds and report the average benchmark performance, using temperature=1.0 and top_p=0.95.

Competitor results – We used publicly reported scores for competitor models, either from original reports or the Artificial Analysis Intelligence Index, where available. Additionally, Gemma4’s scores for agentic coding tasks were reported by Qwen team \[17]. For benchmark results that any public report is missing, denoted by (*) in Figure 1, we run them internally using the recommended model configuration.

Citation

@misc{cohere_north_code_mini,
    title = {Introducing {North Mini Code}: Cohere's First Model For Developers},
    url = {cohere.com/blog/north-mini-code},
    author = {{Team Cohere}},
    month = {June},
    year = {2026}
}

References

[1] RoPE to NoPE and Back Again: A New Hybrid Attention Strategy

[2] SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

[3] SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

[4] On Leakage of Code Generation Evaluation Datasets

[5] Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

[6] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation

[7] Terminal-Bench: A Benchmark for AI Agents in Terminal Environments

[8] SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

[9] https://github.com/SWE-agent/mini-swe-agent

[10] https://github.com/anomalyco/opencode

[11] Forge: Scalable Agent RL Framework and Algorithm

[12] MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

[13] Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs

[14] Harbor: A Framework for Evaluating and Optimizing Agents and Models in Container Environments

[15] SciCode: A Research Coding Benchmark Curated by Scientists

[16] LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

[17] Qwen3.6-35B-A3B: Agentic Coding Power, Now Open to All

Footnotes

1. AAII Coding Index includes Terminal Bench Hard as an agentic coding task and SciCode as code generation benchmark for scientific problems. ↩

2. Both individual ratings and preferences are assessed on a five-point Likert scale. ↩

Models mentioned in this article 2

Meet Cohere Transcribe Arabic

July 7, 2026

Build Small Hackathon With Cohere Models

June 4, 2026

Community

Svngoku

Jun 9

Congrats !!!

yiakwy-xpu-team

Jun 12

@coherecode Thank you for sharing! In our internal test DSV4 is really good as baseline , so it will be good that we could sft/RL over DSV4 (we have ran though whole pipeline in week 0 support, and it will take 2 days to finish trainining over 10k+ tool calls/coding corpus) ?

I wonder if dsv4+ coding corpus can generate new record in coding (for agents).

kishl

Jun 25

this model is too good! Trying it out on opencode and it kills! it's super fast too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote