81.5 GB
30 files
Updated 8 days ago
Name
Size
docs
.gitattributes1.68 kB
xet
LICENSE1.51 kB
xet
README.md3.15 kB
xet
app.py2.68 kB
xet
chat_template.jinja7.76 kB
xet
config.json4.13 kB
xet
configuration.json48 Bytes
xet
configuration_deepseek.py12.5 kB
xet
diffusion_pytorch_model.safetensors25.9 GB
xet
generation_config.json244 Bytes
xet
merges.txt3.35 MB
xet
model.safetensors-00001-of-00011.safetensors5.26 GB
xet
model.safetensors-00002-of-00011.safetensors5.35 GB
xet
model.safetensors-00003-of-00011.safetensors5.35 GB
xet
model.safetensors-00004-of-00011.safetensors5.35 GB
xet
model.safetensors-00005-of-00011.safetensors5.35 GB
xet
model.safetensors-00006-of-00011.safetensors5.35 GB
xet
model.safetensors-00007-of-00011.safetensors5.35 GB
xet
model.safetensors-00008-of-00011.safetensors5.37 GB
xet
model.safetensors-00009-of-00011.safetensors5.35 GB
xet
model.safetensors-00010-of-00011.safetensors5.35 GB
xet
model.safetensors-00011-of-00011.safetensors2.15 GB
xet
model.safetensors.index.json127 kB
xet
modeling_deepseek.py47.5 kB
xet
preprocessor_config.json390 Bytes
xet
tokenizer.json12.8 MB
xet
tokenizer_config.json16.7 kB
xet
video_preprocessor_config.json385 Bytes
xet
vocab.json6.72 MB
xet
README.md

InstinctRazor — Qwen3.5-122B-A10B · IQ3_XXS GGUF

A sub-4-bit (≈3 bpw) quantization of Qwen3.5-122B-A10B — a 122B hybrid Gated-DeltaNet MoE (256 experts, 8 active) — packed to 48 GiB so it runs on one 80 GB GPU (or a small card + CPU offload). Quantized from the original BF16 with an importance matrix (math + code + general calibration), via llama.cpp.

Framework, recipe, and full reproduction: https://github.com/General-Instinct/InstinctRazor

Files

file size notes
InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf 48.0 GiB text model — routed experts IQ3_XXS (≈3.06 bpw)
InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf 0.8 GiB vision projector (mmproj) for multimodal via --mmproj

Protected recipe: routed experts IQ3_XXS · shared-expert int8 · attention int4 · router + Gated-DeltaNet/SSM f16 · embed + lm_head q8_0.

Quality (same-harness, vs the footprint-matched Gemma-4-26B-A4B ≈52 GB)

benchmark this GGUF A4B note
MMLU-Pro (n=150) 90.7 85.6 ≥ A4B, 0 truncation
GPQA-Diamond (n=198) 80.8 79.3 ≥ A4B, 0 truncation

Tracks the weight-only fake-quant capability ceiling (MMLU-Pro 88.5–90) within noise.

Speed (llama.cpp, this artifact)

  • 1× H100-80GB, all layers on GPU: 115.9 tok/s decode (prefill ≈2541 tok/s).
  • Small card + CPU expert-offload (--n-cpu-moe 48, peak ≈7.6 GiB VRAM): 45.7 tok/s decode — runs on an 8 GB GPU + ≈48 GiB system RAM.

Run

# full GPU
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 -fa on -p "Your prompt"
# small card + CPU offload (routed experts on CPU)
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 --n-cpu-moe 48 -t 52 -p "Your prompt"
# multimodal (image input)
llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf --mmproj InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf --image pic.png -p "Describe the image"

Requires a llama.cpp build with qwen3_5_moe support (upstream, 2026-02+).

Scope & roadmap

This GGUF matches or beats the footprint-matched A4B on knowledge, reasoning, and multimodal-MMMU. Where it still trails — code (LiveCodeBench v6) and math / multimodal-math — the loss is largely token-inefficiency introduced by quantization, and is the target of OPD (on-policy distillation), a separate framework we'll open-source later. Eval absolutes are subject to a same-harness validation gate; see the GitHub results/RESULTS.md for full per-number provenance.

Attribution

  • Base model: Qwen3.5-122B-A10B © Qwen — subject to its own model license.
  • Quantization recipe + framework: General Instinct, released under Apache-2.0.
Total size
81.5 GB
Files
30
Last updated
Jun 15
Pre-warmed CDN
US EU US EU

Contributors