Instructions to use Arki05/North-Mini-Code-1.0-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Arki05/North-Mini-Code-1.0-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Arki05/North-Mini-Code-1.0-GGUF",
	filename="North-Mini-Code-1.0-BF16-00001-of-00002.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Arki05/North-Mini-Code-1.0-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Use Docker

docker model run hf.co/Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use Arki05/North-Mini-Code-1.0-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Arki05/North-Mini-Code-1.0-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Arki05/North-Mini-Code-1.0-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Ollama
How to use Arki05/North-Mini-Code-1.0-GGUF with Ollama:
```
ollama run hf.co/Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M
```

Unsloth Studio

How to use Arki05/North-Mini-Code-1.0-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Arki05/North-Mini-Code-1.0-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Arki05/North-Mini-Code-1.0-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Arki05/North-Mini-Code-1.0-GGUF to start chatting

How to use Arki05/North-Mini-Code-1.0-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Arki05/North-Mini-Code-1.0-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use Arki05/North-Mini-Code-1.0-GGUF with Docker Model Runner:
```
docker model run hf.co/Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M
```

Lemonade

How to use Arki05/North-Mini-Code-1.0-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Arki05/North-Mini-Code-1.0-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.North-Mini-Code-1.0-GGUF-Q4_K_M

List all available models

lemonade list

North-Mini-Code-1.0 — GGUF

GGUF quantizations of CohereLabs/North-Mini-Code-1.0, a 30.5B-total / ~2.9B-active sparse MoE code model by Cohere (cohere2moe architecture: Command-R7B-style hybrid SWA/full attention with NoPE on global layers, parallel residual blocks, 128 fine-grained experts with sigmoid top-8 routing, reasoning-by-default chat format). Trained with SFT followed by RL with verifiable rewards, aimed at agentic coding and terminal/tool-use work — see the release blog post.

Status / requirements: needs llama.cpp with cohere2moe support — PR #24260 (not yet merged). Build that branch until it lands. Weights are released under Apache 2.0, and these files inherit that license.

Quants

All quality numbers are measured against the bf16 model as ground truth. The headline table uses wikitext-2 (test) — the only evaluation set that is fully held out from the imatrix calibration data — plus HumanEval/HumanEval+ (pass@1, greedy, thinking on, 6k token budget).

file	size	PPL	mean KLD	top-1 %	HumanEval	HumanEval+
BF16 (2 shards)	61.0 GB	7.7126	—	—
Q8_0	32.4 GB	7.7356	0.007010	96.458	92.07	89.02
Q6_K	25.1 GB	7.7558	0.015611	94.602	93.29	88.41
Q5_K_M	21.7 GB	7.8333	0.020963	93.811	95.73	92.68
Q4_K_M	18.6 GB	7.9468	0.041855	91.342	93.29	90.24
IQ4_XS	16.4 GB	7.9794	0.049137	90.705	92.68	88.41
IQ3_M	13.6 GB	8.2776	0.112035	85.919	90.85	87.20
IQ2_M	10.3 GB	9.9756	0.283656	77.616	84.15	79.88
IQ2_XS	9.2 GB	11.0666	0.426120	73.339	79.88	77.44
IQ2_XXS	8.3 GB	12.6780	0.549859	69.743	59.15	59.15

HumanEval is pass@1 over 164 problems, so single-token greedy flips on a handful of problems move the score by a few points - read it as a sanity check, not a fine-grained ranking. The Q4-through-Q8 quants are statistically interchangeable on it (the spread is noise); mean KLD and top-1 % are the reliable quality ordering. The slope only becomes clear lower down: IQ3_M holds up, the IQ2 tier degrades visibly, and IQ2_XXS falls off a cliff (identical HumanEval/HumanEval+ is the giveaway - it produces enough malformed code that the extra tests prune almost nothing further).

Recommendations: Q5_K_M if you have the memory (effectively lossless), IQ4_XS for the best size/quality ratio (matches Q4_K_M at -2.2 GB), IQ3_M as the smallest quant still reasonable for code. The IQ2 tier exists for memory-constrained setups and degrades noticeably - use with expectations set accordingly. Embeddings are tied (also the output head) and kept at q6_K on Q4_K_M and below.

Per-domain breakdown

The three sets below are also part of the imatrix calibration corpus, so their numbers carry a mild in-distribution bias - read them as domain comparisons rather than held-out scores. All corpora are included in eval-corpora.tar.zst for reproduction.

General / multilingual (calibration_datav3)

bartowski's calibration_datav3: the de-facto community calibration mix - short English prose, multilingual snippets, code fragments, technical text and deliberate noise sections (~275 kB).

file	PPL	mean KLD	top-1 %
BF16	9.0079	—	—
Q8_0	9.0261	0.008424	96.788
Q6_K	9.0351	0.014500	95.286
Q5_K_M	9.0491	0.019470	94.506
Q4_K_M	9.1607	0.036786	92.031
IQ4_XS	9.1125	0.039540	91.882
IQ3_M	9.4710	0.087992	87.714
IQ2_M	10.2735	0.208782	80.580
IQ2_XS	11.1268	0.319906	76.376
IQ2_XXS	12.3083	0.427367	72.173

Code

A seeded random sample of real source files from the llama.cpp tree (MIT): C/C++ core and ggml, Python conversion tooling, shell scripts; capped at 25 kB per file, ~400 kB total. Note how confident the model is on code (PPL ~2.4) - and that top-1 agreement holds up better here than on prose at every quant level.

file	PPL	mean KLD	top-1 %
BF16	2.4043	—	—
Q8_0	2.4108	0.005231	98.512
Q6_K	2.4123	0.008321	97.731
Q5_K_M	2.4155	0.012198	97.145
Q4_K_M	2.4314	0.025947	95.898
IQ4_XS	2.4452	0.030205	95.472
IQ3_M	2.4996	0.072891	92.991
IQ2_M	2.7561	0.186894	88.646
IQ2_XS	3.0247	0.290555	85.260
IQ2_XXS	3.2342	0.368478	83.263

Chat (model-native format)

Hand-written for this release: 13 short programming conversations (Python/SQL/C/Rust/git topics, two in German), each with a thinking block, plus one complete tool-call round trip - rendered in the model's raw turn-token dialect (<|START_OF_TURN_TOKEN|>, <|START_THINKING|>, <|START_ACTION|>, ...). This exercises the control-token and expert-routing paths that real chat traffic hits and plain text never does. Small set (~7 chunks) - treat the numbers as indicative.

file	PPL	mean KLD	top-1 %
BF16	1.9660	—	—
Q8_0	1.9866	0.022651	98.431
Q6_K	1.9906	0.031189	98.170
Q5_K_M	1.9820	0.025972	97.778
Q4_K_M	1.9641	0.070232	96.993
IQ4_XS	1.9866	0.058722	96.601
IQ3_M	2.0809	0.081966	94.902
IQ2_M	2.1412	0.173477	92.288
IQ2_XS	2.1742	0.251918	89.412
IQ2_XXS	2.2247	0.297151	87.974

Reasoning / chat template

These GGUFs embed an additively normalized chat template (also in this repo as chat_template.jinja): the standard enable_thinking / reasoning_content conventions are mapped onto Cohere's native reasoning / reasoning_effort / thinking variables, so llama.cpp detects reasoning support automatically (thinking = 1), separates reasoning_content from content, and supports thinking toggles. All Cohere-native variables keep working; rendering is byte-identical for native invocations.

llama-server -m North-Mini-Code-1.0-Q5_K_M.gguf --jinja

thinking on (default): response arrives as reasoning_content + content
disable thinking per request: "chat_template_kwargs": {"enable_thinking": false} (or Cohere-native: {"reasoning_effort": "none"})
tool calling works through the OpenAI-compatible API (parallel calls included)

imatrix

North-Mini-Code-1.0.imatrix (included) was computed on the bf16 model over the v3 + code + chat mix described above (326x512-token chunks), reaching full coverage of all 128 experts in every layer.

Validation

f32 logit-level parity vs HF transformers on a truncated-expert variant of the checkpoint (full-vocab comparison at every position): top-1 agreement 26/27, mean |dlogprob| 0.012 - the only disagreement a 0.013 near-tie.
Tool calling, parallel calls, multi-turn with reasoning passback, and a live agentic tool-execution loop verified end to end via llama-server.
The official model card states 256K input / 64K output context; the config's max_position_embeddings is 500k. KV cache at long context stays small thanks to iSWA (only 13 of 49 layers are global; ~13.6 GB KV at 500k).

Downloads last month: 125

GGUF

Model size

30B params

Architecture

cohere2moe

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for Arki05/North-Mini-Code-1.0-GGUF

Base model

CohereLabs/North-Mini-Code-1.0

Quantized

(14)

this model