Instructions to use pirola/context-1-mxfp4-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use pirola/context-1-mxfp4-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pirola/context-1-mxfp4-gguf", filename="context1-mxfp4.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pirola/context-1-mxfp4-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pirola/context-1-mxfp4-gguf # Run inference directly in the terminal: llama-cli -hf pirola/context-1-mxfp4-gguf
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf pirola/context-1-mxfp4-gguf # Run inference directly in the terminal: llama-cli -hf pirola/context-1-mxfp4-gguf
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pirola/context-1-mxfp4-gguf # Run inference directly in the terminal: ./llama-cli -hf pirola/context-1-mxfp4-gguf
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pirola/context-1-mxfp4-gguf # Run inference directly in the terminal: ./build/bin/llama-cli -hf pirola/context-1-mxfp4-gguf
Use Docker
docker model run hf.co/pirola/context-1-mxfp4-gguf
- LM Studio
- Jan
- vLLM
How to use pirola/context-1-mxfp4-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pirola/context-1-mxfp4-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pirola/context-1-mxfp4-gguf", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pirola/context-1-mxfp4-gguf
- Ollama
How to use pirola/context-1-mxfp4-gguf with Ollama:
ollama run hf.co/pirola/context-1-mxfp4-gguf
- Unsloth Studio
How to use pirola/context-1-mxfp4-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pirola/context-1-mxfp4-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pirola/context-1-mxfp4-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pirola/context-1-mxfp4-gguf to start chatting
- Pi
How to use pirola/context-1-mxfp4-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pirola/context-1-mxfp4-gguf
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pirola/context-1-mxfp4-gguf" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pirola/context-1-mxfp4-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf pirola/context-1-mxfp4-gguf
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pirola/context-1-mxfp4-gguf
Run Hermes
hermes
- Docker Model Runner
How to use pirola/context-1-mxfp4-gguf with Docker Model Runner:
docker model run hf.co/pirola/context-1-mxfp4-gguf
- Lemonade
How to use pirola/context-1-mxfp4-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pirola/context-1-mxfp4-gguf
Run and chat with the model
lemonade run user.context-1-mxfp4-gguf-{{QUANT_TAG}}List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Chroma Context-1 β APEX MXFP4 (GGUF)
MoE-aware mixed-precision MXFP4 quantization of chromadb/context-1
(a 20B gpt-oss-20b agentic-search model), produced with APEX (Adaptive Precision for EXpert models)
and an importance matrix (imatrix). Runs the full 128k context on a single 16 GB GPU.
- Base:
chromadb/context-1(gpt-oss-20b: 24 layers, 32 experts, top-4, hidden 2880, 128k ctx) β built onopenai/gpt-oss-20b - This file:
context1-mxfp4.ggufβ ~12 GB (vs 39 GiB F16 β 69% smaller) - Runtime: llama.cpp (
llama-server/llama-cli)
Chroma's own card noted "MXFP4 quantized checkpoint coming soon" β this is an independent community GGUF build in the meantime.
Why APEX (not a uniform quant)
MoE models spend most of their parameters in routed experts that are only sparsely active (here 4 of 32 per token). APEX exploits that structure: it assigns precision per tensor type and per layer instead of one global type β keeping the always-hot, heavy-tailed paths high precision and compressing the sparse experts hard.
Quant recipe (this build)
| Tensor | Precision | Rationale |
|---|---|---|
Routed experts ffn_{gate,up,down}_exps |
MXFP4 (imatrix) | bulk of params, sparsely active β compress hard |
Attention attn_q/k/v/output |
Q8_0 | hot path |
token_embd, output (lm_head) |
Q8_0 | |
Router ffn_gate_inp, attn_sinks, norms |
F32 | tiny, routing/normalization-critical |
The MXFP4 experts are steered by an imatrix (importance matrix). The non-expert tensors are 2880-wide (not divisible by 256), so K-quants don't apply β Q8_0 is the precision sweet spot here. We empirically tested Q6_K (where eligible) and BF16 for the non-expert paths; both were equal-or-worse than Q8_0 at equal or larger size, so Q8_0 was kept.
imatrix calibration & dataset
Layer-by-layer generation (memory-frugal, one decoder layer streamed to GPU at a time).
Calibration dataset: a diverse, agentic-leaning mix (REAM domains) β not Wikipedia β re-tokenized with context-1's own tokenizer:
domain sequences math 1015 swe 915 terminal_agent 514 science 514 chat 414 instruction_following 314 conversational_agent 151 total 3,837 sequences / 29.86M tokens (max_len 12288) Generated on a single RTX 5080 (16 GB).
Files in this repo
| File | What |
|---|---|
context1-mxfp4.gguf |
the quantized model (~12 GB) |
imatrix_context1.dat |
the importance matrix (llama.cpp legacy format) β reuse it to re-quantize |
calibration_context1.pt |
the exact tokenized calibration set used to build the imatrix ({"sequences", "domains"}) β for full reproducibility |
MXFP4 vs NVFP4 β why MXFP4
Both MXFP4-expert and NVFP4-expert builds were produced from the same imatrix and evaluated head-to-head (wikitext, lower = better):
| KV cache | ctx | MXFP4 | NVFP4 |
|---|---|---|---|
| F16 | 512 | 80.22 Β± 0.69 | 86.21 Β± 0.75 |
| q8_0 | 512 | 80.07 Β± 0.69 | 86.29 Β± 0.75 |
| q8_0 | 32k | 655.26 Β± 6.75 | 921.32 Β± 9.64 |
MXFP4 wins at every setting, and its lead widens with context (β7% at 512 β β29% at 32k) β it degrades far more gracefully on long sequences. q8_0 KV vs F16 KV is essentially free (80.07 vs 80.22).
β οΈ The absolute PPL is high because context-1 is an agentic-search fine-tune and raw wikitext is out-of-domain (and the 32k rows are only ~8 chunks). These numbers are a relative quant-selection metric, not a statement of the model's task quality. For task quality, evaluate on retrieval/agentic data.
Long-context proof (128k, needle-in-a-haystack)
Verified end-to-end on the RTX 5080 with F16 KV cache at the full 131,072-token context:
- Document: 130,497-token haystack with a secret passphrase planted at 50% depth.
- Result: 130,564 prompt tokens ingested, 14.8 GB / 16 GB VRAM (no OOM),
finish_reason: stop, and the model returned the exact planted passphrase.
So the full 128k window is usable with F16 KV on 16 GB. q8_0 KV (~13.4 GB) leaves more headroom at equal quality.
Serving (llama.cpp)
llama-server \
--model context1-mxfp4.gguf --alias context-1 \
--n-gpu-layers 99 --ctx-size 131072 --parallel 1 \
--cache-type-k f16 --cache-type-v f16 \
--flash-attn on --jinja \
--host 0.0.0.0 --port 8080
--jinjaenables the embedded harmony chat template. gpt-oss puts chain-of-thought in a separatereasoning_contentchannel β allowmax_tokensβ₯ ~128 or the visiblecontentcan come back empty.- For more VRAM headroom use
--cache-type-k q8_0 --cache-type-v q8_0(β identical quality, ~13.4 GB @128k). - Max context is 131,072 (native; no rope scaling needed).
OpenAI-compatible API at http://localhost:8080/v1.
Hardware
Fits a single 16 GB GPU at full 128k context: ~12 GB weights + ~1.4 GB (q8_0) or ~2.8 GB (f16) KV.
Credits & license
- Base model:
chromadb/context-1(technical report), onopenai/gpt-oss-20b. - Quantization: APEX MoE-aware mixed precision + imatrix, via a fork of llama.cpp with first-class MXFP4/NVFP4.
- License: Apache-2.0 (inherited from the base model).
- Downloads last month
- -
We're not able to determine the quantization variants.
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pirola/context-1-mxfp4-gguf", filename="context1-mxfp4.gguf", )