Buckets:
license: apache-2.0
base_model:
- Qwen/Qwen3.5-122B-A10B
tags:
- gguf
- llama.cpp
- mixture-of-experts
- quantized
- iq3_xxs
- instinctrazor
pipeline_tag: text-generation
InstinctRazor — Qwen3.5-122B-A10B · IQ3_XXS GGUF
A sub-4-bit (≈3 bpw) quantization of Qwen3.5-122B-A10B — a 122B hybrid Gated-DeltaNet MoE (256 experts, 8 active) — packed to 48 GiB so it runs on one 80 GB GPU (or a small card + CPU offload). Quantized from the original BF16 with an importance matrix (math + code + general calibration), via llama.cpp.
Framework, recipe, and full reproduction: https://github.com/General-Instinct/InstinctRazor
Files
| file | size | notes |
|---|---|---|
InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf |
48.0 GiB | text model — routed experts IQ3_XXS (≈3.06 bpw) |
InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf |
0.8 GiB | vision projector (mmproj) for multimodal via --mmproj |
Protected recipe: routed experts IQ3_XXS · shared-expert int8 · attention int4 · router + Gated-DeltaNet/SSM f16 · embed + lm_head q8_0.
Quality (same-harness, vs the footprint-matched Gemma-4-26B-A4B ≈52 GB)
| benchmark | this GGUF | A4B | note |
|---|---|---|---|
| MMLU-Pro (n=150) | 90.7 | 85.6 | ≥ A4B, 0 truncation |
| GPQA-Diamond (n=198) | 80.8 | 79.3 | ≥ A4B, 0 truncation |
Tracks the weight-only fake-quant capability ceiling (MMLU-Pro 88.5–90) within noise.
Speed (llama.cpp, this artifact)
- 1× H100-80GB, all layers on GPU: 115.9 tok/s decode (prefill ≈2541 tok/s).
- Small card + CPU expert-offload (
--n-cpu-moe 48, peak ≈7.6 GiB VRAM): 45.7 tok/s decode — runs on an 8 GB GPU + ≈48 GiB system RAM.
Run
# full GPU
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 -fa on -p "Your prompt"
# small card + CPU offload (routed experts on CPU)
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 --n-cpu-moe 48 -t 52 -p "Your prompt"
# multimodal (image input)
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf --mmproj InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf --image pic.png -p "Describe the image"
Requires a llama.cpp build with qwen3_5_moe support (upstream, 2026-02+).
Scope & roadmap
This GGUF matches or beats the footprint-matched A4B on knowledge, reasoning, and multimodal-MMMU. Where it
still trails — code (LiveCodeBench v6) and math / multimodal-math — the loss is largely
token-inefficiency introduced by quantization, and is the target of OPD (on-policy distillation), a
separate framework we'll open-source later. Eval absolutes are subject to a same-harness validation gate;
see the GitHub results/RESULTS.md
for full per-number provenance.
Attribution
- Base model: Qwen3.5-122B-A10B © Qwen — subject to its own model license.
- Quantization recipe + framework: General Instinct, released under Apache-2.0.
Xet Storage Details
- Size:
- 3.15 kB
- Xet hash:
- 20770fb17e86e67e1d4ce921eb14518c07234bdb60d4500a4422b95783de1a42
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.