North-Mini-Code-1.0 — GGUF

GGUF quantizations of CohereLabs/North-Mini-Code-1.0, a 30.5B-total / ~2.9B-active sparse MoE code model by Cohere (cohere2moe architecture: Command-R7B-style hybrid SWA/full attention with NoPE on global layers, parallel residual blocks, 128 fine-grained experts with sigmoid top-8 routing, reasoning-by-default chat format). Trained with SFT followed by RL with verifiable rewards, aimed at agentic coding and terminal/tool-use work — see the release blog post.

Status / requirements: needs llama.cpp with cohere2moe support — PR #24260 (not yet merged). Build that branch until it lands. Weights are released under Apache 2.0, and these files inherit that license.

Quants

All quality numbers are measured against the bf16 model as ground truth. The headline table uses wikitext-2 (test) — the only evaluation set that is fully held out from the imatrix calibration data — plus HumanEval/HumanEval+ (pass@1, greedy, thinking on, 6k token budget).

file size PPL mean KLD top-1 % HumanEval HumanEval+
BF16 (2 shards) 61.0 GB 7.7126
Q8_0 32.4 GB 7.7356 0.007010 96.458 92.07 89.02
Q6_K 25.1 GB 7.7558 0.015611 94.602 93.29 88.41
Q5_K_M 21.7 GB 7.8333 0.020963 93.811 95.73 92.68
Q4_K_M 18.6 GB 7.9468 0.041855 91.342 93.29 90.24
IQ4_XS 16.4 GB 7.9794 0.049137 90.705 92.68 88.41
IQ3_M 13.6 GB 8.2776 0.112035 85.919 90.85 87.20
IQ2_M 10.3 GB 9.9756 0.283656 77.616 84.15 79.88
IQ2_XS 9.2 GB 11.0666 0.426120 73.339 79.88 77.44
IQ2_XXS 8.3 GB 12.6780 0.549859 69.743 59.15 59.15

HumanEval is pass@1 over 164 problems, so single-token greedy flips on a handful of problems move the score by a few points - read it as a sanity check, not a fine-grained ranking. The Q4-through-Q8 quants are statistically interchangeable on it (the spread is noise); mean KLD and top-1 % are the reliable quality ordering. The slope only becomes clear lower down: IQ3_M holds up, the IQ2 tier degrades visibly, and IQ2_XXS falls off a cliff (identical HumanEval/HumanEval+ is the giveaway - it produces enough malformed code that the extra tests prune almost nothing further).

Recommendations: Q5_K_M if you have the memory (effectively lossless), IQ4_XS for the best size/quality ratio (matches Q4_K_M at -2.2 GB), IQ3_M as the smallest quant still reasonable for code. The IQ2 tier exists for memory-constrained setups and degrades noticeably - use with expectations set accordingly. Embeddings are tied (also the output head) and kept at q6_K on Q4_K_M and below.

Per-domain breakdown

The three sets below are also part of the imatrix calibration corpus, so their numbers carry a mild in-distribution bias - read them as domain comparisons rather than held-out scores. All corpora are included in eval-corpora.tar.zst for reproduction.

General / multilingual (calibration_datav3)

bartowski's calibration_datav3: the de-facto community calibration mix - short English prose, multilingual snippets, code fragments, technical text and deliberate noise sections (~275 kB).

file PPL mean KLD top-1 %
BF16 9.0079
Q8_0 9.0261 0.008424 96.788
Q6_K 9.0351 0.014500 95.286
Q5_K_M 9.0491 0.019470 94.506
Q4_K_M 9.1607 0.036786 92.031
IQ4_XS 9.1125 0.039540 91.882
IQ3_M 9.4710 0.087992 87.714
IQ2_M 10.2735 0.208782 80.580
IQ2_XS 11.1268 0.319906 76.376
IQ2_XXS 12.3083 0.427367 72.173

Code

A seeded random sample of real source files from the llama.cpp tree (MIT): C/C++ core and ggml, Python conversion tooling, shell scripts; capped at 25 kB per file, ~400 kB total. Note how confident the model is on code (PPL ~2.4) - and that top-1 agreement holds up better here than on prose at every quant level.

file PPL mean KLD top-1 %
BF16 2.4043
Q8_0 2.4108 0.005231 98.512
Q6_K 2.4123 0.008321 97.731
Q5_K_M 2.4155 0.012198 97.145
Q4_K_M 2.4314 0.025947 95.898
IQ4_XS 2.4452 0.030205 95.472
IQ3_M 2.4996 0.072891 92.991
IQ2_M 2.7561 0.186894 88.646
IQ2_XS 3.0247 0.290555 85.260
IQ2_XXS 3.2342 0.368478 83.263

Chat (model-native format)

Hand-written for this release: 13 short programming conversations (Python/SQL/C/Rust/git topics, two in German), each with a thinking block, plus one complete tool-call round trip - rendered in the model's raw turn-token dialect (<|START_OF_TURN_TOKEN|>, <|START_THINKING|>, <|START_ACTION|>, ...). This exercises the control-token and expert-routing paths that real chat traffic hits and plain text never does. Small set (~7 chunks) - treat the numbers as indicative.

file PPL mean KLD top-1 %
BF16 1.9660
Q8_0 1.9866 0.022651 98.431
Q6_K 1.9906 0.031189 98.170
Q5_K_M 1.9820 0.025972 97.778
Q4_K_M 1.9641 0.070232 96.993
IQ4_XS 1.9866 0.058722 96.601
IQ3_M 2.0809 0.081966 94.902
IQ2_M 2.1412 0.173477 92.288
IQ2_XS 2.1742 0.251918 89.412
IQ2_XXS 2.2247 0.297151 87.974

Reasoning / chat template

These GGUFs embed an additively normalized chat template (also in this repo as chat_template.jinja): the standard enable_thinking / reasoning_content conventions are mapped onto Cohere's native reasoning / reasoning_effort / thinking variables, so llama.cpp detects reasoning support automatically (thinking = 1), separates reasoning_content from content, and supports thinking toggles. All Cohere-native variables keep working; rendering is byte-identical for native invocations.

llama-server -m North-Mini-Code-1.0-Q5_K_M.gguf --jinja
  • thinking on (default): response arrives as reasoning_content + content
  • disable thinking per request: "chat_template_kwargs": {"enable_thinking": false} (or Cohere-native: {"reasoning_effort": "none"})
  • tool calling works through the OpenAI-compatible API (parallel calls included)

imatrix

North-Mini-Code-1.0.imatrix (included) was computed on the bf16 model over the v3 + code + chat mix described above (326x512-token chunks), reaching full coverage of all 128 experts in every layer.

Validation

  • f32 logit-level parity vs HF transformers on a truncated-expert variant of the checkpoint (full-vocab comparison at every position): top-1 agreement 26/27, mean |dlogprob| 0.012 - the only disagreement a 0.013 near-tie.
  • Tool calling, parallel calls, multi-turn with reasoning passback, and a live agentic tool-execution loop verified end to end via llama-server.
  • The official model card states 256K input / 64K output context; the config's max_position_embeddings is 500k. KV cache at long context stays small thanks to iSWA (only 13 of 49 layers are global; ~13.6 GB KV at 500k).
Downloads last month
125
GGUF
Model size
30B params
Architecture
cohere2moe
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Arki05/North-Mini-Code-1.0-GGUF

Quantized
(14)
this model