Buckets:
| license: apache-2.0 | |
| base_model: | |
| - Qwen/Qwen3.5-122B-A10B | |
| tags: | |
| - gguf | |
| - llama.cpp | |
| - mixture-of-experts | |
| - quantized | |
| - iq3_xxs | |
| - instinctrazor | |
| pipeline_tag: text-generation | |
| # InstinctRazor — Qwen3.5-122B-A10B · IQ3_XXS GGUF | |
| A sub-4-bit (≈3 bpw) quantization of **[Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)** | |
| — a 122B hybrid Gated-DeltaNet MoE (256 experts, 8 active) — packed to **48 GiB** so it runs on **one 80 GB | |
| GPU** (or a small card + CPU offload). Quantized **from the original BF16** with an importance matrix | |
| (math + code + general calibration), via [llama.cpp](https://github.com/ggml-org/llama.cpp). | |
| Framework, recipe, and full reproduction: **https://github.com/General-Instinct/InstinctRazor** | |
| ## Files | |
| | file | size | notes | | |
| |------|------|-------| | |
| | `InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf` | 48.0 GiB | text model — routed experts IQ3_XXS (≈3.06 bpw) | | |
| | `InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf` | 0.8 GiB | vision projector (mmproj) for multimodal via `--mmproj` | | |
| **Protected recipe:** routed experts IQ3_XXS · shared-expert int8 · attention int4 · router + Gated-DeltaNet/SSM f16 · embed + lm_head q8_0. | |
| ## Quality (same-harness, vs the footprint-matched Gemma-4-26B-A4B ≈52 GB) | |
| | benchmark | this GGUF | A4B | note | | |
| |-----------|-----------|-----|------| | |
| | MMLU-Pro (n=150) | **90.7** | 85.6 | ≥ A4B, 0 truncation | | |
| | GPQA-Diamond (n=198) | **80.8** | 79.3 | ≥ A4B, 0 truncation | | |
| Tracks the weight-only fake-quant capability ceiling (MMLU-Pro 88.5–90) within noise. | |
| ## Speed (llama.cpp, this artifact) | |
| - **1× H100-80GB**, all layers on GPU: **115.9 tok/s** decode (prefill ≈2541 tok/s). | |
| - **Small card + CPU expert-offload** (`--n-cpu-moe 48`, peak ≈7.6 GiB VRAM): **45.7 tok/s** decode — runs on an 8 GB GPU + ≈48 GiB system RAM. | |
| ## Run | |
| ```bash | |
| # full GPU | |
| llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 -fa on -p "Your prompt" | |
| # small card + CPU offload (routed experts on CPU) | |
| llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 --n-cpu-moe 48 -t 52 -p "Your prompt" | |
| # multimodal (image input) | |
| llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf --mmproj InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf --image pic.png -p "Describe the image" | |
| ``` | |
| Requires a llama.cpp build with `qwen3_5_moe` support (upstream, 2026-02+). | |
| ## Scope & roadmap | |
| This GGUF matches or beats the footprint-matched A4B on knowledge, reasoning, and multimodal-MMMU. Where it | |
| still trails — **code (LiveCodeBench v6)** and **math / multimodal-math** — the loss is largely | |
| token-inefficiency introduced by quantization, and is the target of **OPD (on-policy distillation)**, a | |
| separate framework we'll open-source later. Eval absolutes are subject to a same-harness validation gate; | |
| see the GitHub [`results/RESULTS.md`](https://github.com/General-Instinct/InstinctRazor/blob/main/results/RESULTS.md) | |
| for full per-number provenance. | |
| ## Attribution | |
| - **Base model:** Qwen3.5-122B-A10B © Qwen — subject to its own model license. | |
| - **Quantization recipe + framework:** General Instinct, released under **Apache-2.0**. | |
Xet Storage Details
- Size:
- 3.15 kB
- Xet hash:
- 20770fb17e86e67e1d4ce921eb14518c07234bdb60d4500a4422b95783de1a42
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.