How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pirola/context-1-mxfp4-gguf",
	filename="context1-mxfp4.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Chroma Context-1 β€” APEX MXFP4 (GGUF)

MoE-aware mixed-precision MXFP4 quantization of chromadb/context-1 (a 20B gpt-oss-20b agentic-search model), produced with APEX (Adaptive Precision for EXpert models) and an importance matrix (imatrix). Runs the full 128k context on a single 16 GB GPU.

  • Base: chromadb/context-1 (gpt-oss-20b: 24 layers, 32 experts, top-4, hidden 2880, 128k ctx) β†’ built on openai/gpt-oss-20b
  • This file: context1-mxfp4.gguf β€” ~12 GB (vs 39 GiB F16 β‰ˆ 69% smaller)
  • Runtime: llama.cpp (llama-server / llama-cli)

Chroma's own card noted "MXFP4 quantized checkpoint coming soon" β€” this is an independent community GGUF build in the meantime.

Why APEX (not a uniform quant)

MoE models spend most of their parameters in routed experts that are only sparsely active (here 4 of 32 per token). APEX exploits that structure: it assigns precision per tensor type and per layer instead of one global type β€” keeping the always-hot, heavy-tailed paths high precision and compressing the sparse experts hard.

Quant recipe (this build)

Tensor Precision Rationale
Routed experts ffn_{gate,up,down}_exps MXFP4 (imatrix) bulk of params, sparsely active β†’ compress hard
Attention attn_q/k/v/output Q8_0 hot path
token_embd, output (lm_head) Q8_0
Router ffn_gate_inp, attn_sinks, norms F32 tiny, routing/normalization-critical

The MXFP4 experts are steered by an imatrix (importance matrix). The non-expert tensors are 2880-wide (not divisible by 256), so K-quants don't apply β€” Q8_0 is the precision sweet spot here. We empirically tested Q6_K (where eligible) and BF16 for the non-expert paths; both were equal-or-worse than Q8_0 at equal or larger size, so Q8_0 was kept.

imatrix calibration & dataset

  • Layer-by-layer generation (memory-frugal, one decoder layer streamed to GPU at a time).

  • Calibration dataset: a diverse, agentic-leaning mix (REAM domains) β€” not Wikipedia β€” re-tokenized with context-1's own tokenizer:

    domain sequences
    math 1015
    swe 915
    terminal_agent 514
    science 514
    chat 414
    instruction_following 314
    conversational_agent 151
    total 3,837 sequences / 29.86M tokens (max_len 12288)
  • Generated on a single RTX 5080 (16 GB).

Files in this repo

File What
context1-mxfp4.gguf the quantized model (~12 GB)
imatrix_context1.dat the importance matrix (llama.cpp legacy format) β€” reuse it to re-quantize
calibration_context1.pt the exact tokenized calibration set used to build the imatrix ({"sequences", "domains"}) β€” for full reproducibility

MXFP4 vs NVFP4 β€” why MXFP4

Both MXFP4-expert and NVFP4-expert builds were produced from the same imatrix and evaluated head-to-head (wikitext, lower = better):

KV cache ctx MXFP4 NVFP4
F16 512 80.22 Β± 0.69 86.21 Β± 0.75
q8_0 512 80.07 Β± 0.69 86.29 Β± 0.75
q8_0 32k 655.26 Β± 6.75 921.32 Β± 9.64

MXFP4 wins at every setting, and its lead widens with context (βˆ’7% at 512 β†’ βˆ’29% at 32k) β€” it degrades far more gracefully on long sequences. q8_0 KV vs F16 KV is essentially free (80.07 vs 80.22).

⚠️ The absolute PPL is high because context-1 is an agentic-search fine-tune and raw wikitext is out-of-domain (and the 32k rows are only ~8 chunks). These numbers are a relative quant-selection metric, not a statement of the model's task quality. For task quality, evaluate on retrieval/agentic data.

Long-context proof (128k, needle-in-a-haystack)

Verified end-to-end on the RTX 5080 with F16 KV cache at the full 131,072-token context:

  • Document: 130,497-token haystack with a secret passphrase planted at 50% depth.
  • Result: 130,564 prompt tokens ingested, 14.8 GB / 16 GB VRAM (no OOM), finish_reason: stop, and the model returned the exact planted passphrase.

So the full 128k window is usable with F16 KV on 16 GB. q8_0 KV (~13.4 GB) leaves more headroom at equal quality.

Serving (llama.cpp)

llama-server \
  --model context1-mxfp4.gguf --alias context-1 \
  --n-gpu-layers 99 --ctx-size 131072 --parallel 1 \
  --cache-type-k f16 --cache-type-v f16 \
  --flash-attn on --jinja \
  --host 0.0.0.0 --port 8080
  • --jinja enables the embedded harmony chat template. gpt-oss puts chain-of-thought in a separate reasoning_content channel β€” allow max_tokens β‰₯ ~128 or the visible content can come back empty.
  • For more VRAM headroom use --cache-type-k q8_0 --cache-type-v q8_0 (β‰ˆ identical quality, ~13.4 GB @128k).
  • Max context is 131,072 (native; no rope scaling needed).

OpenAI-compatible API at http://localhost:8080/v1.

Hardware

Fits a single 16 GB GPU at full 128k context: ~12 GB weights + ~1.4 GB (q8_0) or ~2.8 GB (f16) KV.

Credits & license

  • Base model: chromadb/context-1 (technical report), on openai/gpt-oss-20b.
  • Quantization: APEX MoE-aware mixed precision + imatrix, via a fork of llama.cpp with first-class MXFP4/NVFP4.
  • License: Apache-2.0 (inherited from the base model).
Downloads last month
-
GGUF
Model size
21B params
Architecture
gpt-oss
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for pirola/context-1-mxfp4-gguf

Quantized
(8)
this model