North-Mini-Code-1.0 GGUF

GGUF quantizations of CohereLabs/North-Mini-Code-1.0 (30B-A3B MoE, cohere2_moe architecture).

Requires speedy-llama until cohere2_moe support merges into upstream llama.cpp (PR #24260). These files use the standard GGUF keys and are expected to load on upstream once that PR lands.

File Size Wikitext-2 PPL Same top token as bf16 24GB card
Q4_K_M 18.6 GB 8.34 (+3.2% vs bf16) 90.4% (mean KLD 0.049) ~230 tok/s, fully offloaded
Q5_K_M 21.7 GB 8.19 (+1.3% vs bf16) 93.4% (mean KLD 0.023) ~211 tok/s, fully offloaded (8K ctx)

bf16 baseline PPL 8.09; measured over 64x512-token chunks of wikitext-2-raw test, KL-divergence computed against bf16 logits on the same tokens.

llama-cli -m north-mini-code-Q4_K_M.gguf --jinja -ngl 99 --temp 1.0 --top-p 0.95

Sampling params per the model card. Converted from the bf16 safetensors release; chat template embedded.

Downloads last month
-
GGUF
Model size
30B params
Architecture
cohere2moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for heimann/North-Mini-Code-1.0-GGUF

Quantized
(14)
this model