Instructions to use pirola/context-1-mxfp4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pirola/context-1-mxfp4-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pirola/context-1-mxfp4-gguf",
	filename="context1-mxfp4.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pirola/context-1-mxfp4-gguf with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pirola/context-1-mxfp4-gguf
# Run inference directly in the terminal:
llama-cli -hf pirola/context-1-mxfp4-gguf

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pirola/context-1-mxfp4-gguf
# Run inference directly in the terminal:
llama-cli -hf pirola/context-1-mxfp4-gguf

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pirola/context-1-mxfp4-gguf
# Run inference directly in the terminal:
./llama-cli -hf pirola/context-1-mxfp4-gguf

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pirola/context-1-mxfp4-gguf
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pirola/context-1-mxfp4-gguf

Use Docker

docker model run hf.co/pirola/context-1-mxfp4-gguf

LM Studio
Jan

vLLM

How to use pirola/context-1-mxfp4-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pirola/context-1-mxfp4-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pirola/context-1-mxfp4-gguf",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/pirola/context-1-mxfp4-gguf

Ollama
How to use pirola/context-1-mxfp4-gguf with Ollama:
```
ollama run hf.co/pirola/context-1-mxfp4-gguf
```

Unsloth Studio

How to use pirola/context-1-mxfp4-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pirola/context-1-mxfp4-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pirola/context-1-mxfp4-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pirola/context-1-mxfp4-gguf to start chatting

How to use pirola/context-1-mxfp4-gguf with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pirola/context-1-mxfp4-gguf

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "pirola/context-1-mxfp4-gguf"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use pirola/context-1-mxfp4-gguf with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf pirola/context-1-mxfp4-gguf

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default pirola/context-1-mxfp4-gguf

Run Hermes

hermes

Docker Model Runner
How to use pirola/context-1-mxfp4-gguf with Docker Model Runner:
```
docker model run hf.co/pirola/context-1-mxfp4-gguf
```

Lemonade

How to use pirola/context-1-mxfp4-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pirola/context-1-mxfp4-gguf

Run and chat with the model

lemonade run user.context-1-mxfp4-gguf-{{QUANT_TAG}}

List all available models

lemonade list

Chroma Context-1 — APEX MXFP4 (GGUF)

MoE-aware mixed-precision MXFP4 quantization of chromadb/context-1 (a 20B gpt-oss-20b agentic-search model), produced with APEX (Adaptive Precision for EXpert models) and an importance matrix (imatrix). Runs the full 128k context on a single 16 GB GPU.

Base: chromadb/context-1 (gpt-oss-20b: 24 layers, 32 experts, top-4, hidden 2880, 128k ctx) → built on openai/gpt-oss-20b
This file: context1-mxfp4.gguf — ~12 GB (vs 39 GiB F16 ≈ 69% smaller)
Runtime: llama.cpp (llama-server / llama-cli)

Chroma's own card noted "MXFP4 quantized checkpoint coming soon" — this is an independent community GGUF build in the meantime.

Why APEX (not a uniform quant)

MoE models spend most of their parameters in routed experts that are only sparsely active (here 4 of 32 per token). APEX exploits that structure: it assigns precision per tensor type and per layer instead of one global type — keeping the always-hot, heavy-tailed paths high precision and compressing the sparse experts hard.

Quant recipe (this build)

Tensor	Precision	Rationale
Routed experts `ffn_{gate,up,down}_exps`	MXFP4 (imatrix)	bulk of params, sparsely active → compress hard
Attention `attn_q/k/v/output`	Q8_0	hot path
`token_embd`, `output` (lm_head)	Q8_0
Router `ffn_gate_inp`, `attn_sinks`, norms	F32	tiny, routing/normalization-critical

The MXFP4 experts are steered by an imatrix (importance matrix). The non-expert tensors are 2880-wide (not divisible by 256), so K-quants don't apply — Q8_0 is the precision sweet spot here. We empirically tested Q6_K (where eligible) and BF16 for the non-expert paths; both were equal-or-worse than Q8_0 at equal or larger size, so Q8_0 was kept.

imatrix calibration & dataset

Layer-by-layer generation (memory-frugal, one decoder layer streamed to GPU at a time).

Calibration dataset: a diverse, agentic-leaning mix (REAM domains) — not Wikipedia — re-tokenized with context-1's own tokenizer:

domain	sequences
math	1015
swe	915
terminal_agent	514
science	514
chat	414
instruction_following	314
conversational_agent	151
total	3,837 sequences / 29.86M tokens (max_len 12288)

Generated on a single RTX 5080 (16 GB).

Files in this repo

File	What
`context1-mxfp4.gguf`	the quantized model (~12 GB)
`imatrix_context1.dat`	the importance matrix (llama.cpp legacy format) — reuse it to re-quantize
`calibration_context1.pt`	the exact tokenized calibration set used to build the imatrix (`{"sequences", "domains"}`) — for full reproducibility

MXFP4 vs NVFP4 — why MXFP4

Both MXFP4-expert and NVFP4-expert builds were produced from the same imatrix and evaluated head-to-head (wikitext, lower = better):

KV cache	ctx	MXFP4	NVFP4
F16	512	80.22 ± 0.69	86.21 ± 0.75
q8_0	512	80.07 ± 0.69	86.29 ± 0.75
q8_0	32k	655.26 ± 6.75	921.32 ± 9.64

MXFP4 wins at every setting, and its lead widens with context (−7% at 512 → −29% at 32k) — it degrades far more gracefully on long sequences. q8_0 KV vs F16 KV is essentially free (80.07 vs 80.22).

⚠️ The absolute PPL is high because context-1 is an agentic-search fine-tune and raw wikitext is out-of-domain (and the 32k rows are only ~8 chunks). These numbers are a relative quant-selection metric, not a statement of the model's task quality. For task quality, evaluate on retrieval/agentic data.

Long-context proof (128k, needle-in-a-haystack)

Verified end-to-end on the RTX 5080 with F16 KV cache at the full 131,072-token context:

Document: 130,497-token haystack with a secret passphrase planted at 50% depth.
Result: 130,564 prompt tokens ingested, 14.8 GB / 16 GB VRAM (no OOM), finish_reason: stop, and the model returned the exact planted passphrase.

So the full 128k window is usable with F16 KV on 16 GB. q8_0 KV (~13.4 GB) leaves more headroom at equal quality.

Serving (llama.cpp)

llama-server \
  --model context1-mxfp4.gguf --alias context-1 \
  --n-gpu-layers 99 --ctx-size 131072 --parallel 1 \
  --cache-type-k f16 --cache-type-v f16 \
  --flash-attn on --jinja \
  --host 0.0.0.0 --port 8080

--jinja enables the embedded harmony chat template. gpt-oss puts chain-of-thought in a separate reasoning_content channel — allow max_tokens ≥ ~128 or the visible content can come back empty.
For more VRAM headroom use --cache-type-k q8_0 --cache-type-v q8_0 (≈ identical quality, ~13.4 GB @128k).
Max context is 131,072 (native; no rope scaling needed).

OpenAI-compatible API at http://localhost:8080/v1.

Hardware

Fits a single 16 GB GPU at full 128k context: ~12 GB weights + ~1.4 GB (q8_0) or ~2.8 GB (f16) KV.

Credits & license

Base model: chromadb/context-1 (technical report), on openai/gpt-oss-20b.
Quantization: APEX MoE-aware mixed precision + imatrix, via a fork of llama.cpp with first-class MXFP4/NVFP4.
License: Apache-2.0 (inherited from the base model).

Downloads last month: -

GGUF

Model size

21B params

Architecture

gpt-oss

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for pirola/context-1-mxfp4-gguf

Base model

openai/gpt-oss-20b

Finetuned

chromadb/context-1

Quantized

(8)

this model