Buckets:

sky-meilin
/

AnyCoder

Files

xet

sky-meilin/AnyCoder / README.md

sky-meilin

24 days ago

preview code

download

raw

3.15 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3.5-122B-A10B
	tags:
	- gguf
	- llama.cpp
	- mixture-of-experts
	- quantized
	- iq3_xxs
	- instinctrazor
	pipeline_tag: text-generation
	---

	# InstinctRazor — Qwen3.5-122B-A10B · IQ3_XXS GGUF

	A sub-4-bit (≈3 bpw) quantization of [Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B)
	— a 122B hybrid Gated-DeltaNet MoE (256 experts, 8 active) — packed to 48 GiB so it runs on **one 80 GB
	GPU (or a small card + CPU offload). Quantized from the original BF16** with an importance matrix
	(math + code + general calibration), via [llama.cpp](https://github.com/ggml-org/llama.cpp).

	Framework, recipe, and full reproduction: https://github.com/General-Instinct/InstinctRazor

	## Files
	\| file \| size \| notes \|
	\|------\|------\|-------\|
	\| `InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf` \| 48.0 GiB \| text model — routed experts IQ3_XXS (≈3.06 bpw) \|
	\| `InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf` \| 0.8 GiB \| vision projector (mmproj) for multimodal via `--mmproj` \|

	Protected recipe: routed experts IQ3_XXS · shared-expert int8 · attention int4 · router + Gated-DeltaNet/SSM f16 · embed + lm_head q8_0.

	## Quality (same-harness, vs the footprint-matched Gemma-4-26B-A4B ≈52 GB)
	\| benchmark \| this GGUF \| A4B \| note \|
	\|-----------\|-----------\|-----\|------\|
	\| MMLU-Pro (n=150) \| 90.7 \| 85.6 \| ≥ A4B, 0 truncation \|
	\| GPQA-Diamond (n=198) \| 80.8 \| 79.3 \| ≥ A4B, 0 truncation \|

	Tracks the weight-only fake-quant capability ceiling (MMLU-Pro 88.5–90) within noise.

	## Speed (llama.cpp, this artifact)
	- 1× H100-80GB, all layers on GPU: 115.9 tok/s decode (prefill ≈2541 tok/s).
	- Small card + CPU expert-offload (`--n-cpu-moe 48`, peak ≈7.6 GiB VRAM): 45.7 tok/s decode — runs on an 8 GB GPU + ≈48 GiB system RAM.

	## Run
	```bash
	# full GPU
	llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 -fa on -p "Your prompt"
	# small card + CPU offload (routed experts on CPU)
	llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf -ngl 999 --n-cpu-moe 48 -t 52 -p "Your prompt"
	# multimodal (image input)
	llama-cli -m InstinctRazor-Qwen3.5-122B-A10B-IQ3_XXS.gguf --mmproj InstinctRazor-Qwen3.5-122B-A10B-mmproj-f16.gguf --image pic.png -p "Describe the image"
	```
	Requires a llama.cpp build with `qwen3_5_moe` support (upstream, 2026-02+).

	## Scope & roadmap
	This GGUF matches or beats the footprint-matched A4B on knowledge, reasoning, and multimodal-MMMU. Where it
	still trails — code (LiveCodeBench v6) and math / multimodal-math — the loss is largely
	token-inefficiency introduced by quantization, and is the target of OPD (on-policy distillation), a
	separate framework we'll open-source later. Eval absolutes are subject to a same-harness validation gate;
	see the GitHub [`results/RESULTS.md`](https://github.com/General-Instinct/InstinctRazor/blob/main/results/RESULTS.md)
	for full per-number provenance.

	## Attribution
	- Base model: Qwen3.5-122B-A10B © Qwen — subject to its own model license.
	- Quantization recipe + framework: General Instinct, released under Apache-2.0.

Xet Storage Details

Size:: 3.15 kB
Xet hash:: 20770fb17e86e67e1d4ce921eb14518c07234bdb60d4500a4422b95783de1a42

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.