--- license: mit language: - en pipeline_tag: text-generation library_name: gguf tags: - quantization - gguf - llama-cpp - imatrix - hybrid-quantization - selective-quantization - priority-queue - mse - theoretical-optimization - qwen3.5 - gemma4 - moe - mtp --- # ASHQ1 — Autonomous Selective Hybrid Quantization > ⚠️ **Experimental.** ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome. **Latest update (v6):** The classifier has been overhauled — the empirical depth-weighting heuristic was removed after A/B testing confirmed it added zero value. Quality improved as a result. The same budget now goes further. ASHQ1 is a post-training quantization method for GGUF models that uses an **imatrix-driven priority queue** to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility — the product of summed importance and theoretical MSE reduction, divided by size cost. ## Results | Method | Model | Size | PPL (ctx 1024) | Δ vs Uniform | |:-------|:------|:----:|:--------------:|:------------:| | **ASHQ1** (v6) | Ornith-1.0-9B-MTP | 6012 MiB | **7.4697 ± 0.04862** | **−0.1551** | | Uniform Q6_K | Ornith-1.0-9B-MTP | 7198 MiB | 7.6248 ± 0.05039 | baseline | ASHQ1 beats uniform Q6_K by **0.155 PPL** while being **16.5% smaller** (−1186 MiB). The current classifier (v6) dropped empirical depth-weighting heuristics — the theoretical priority queue now works even better. ASHQ1 is often on par with hand-tuned SHQ quants in quality, and sometimes surpasses them. At the same time, it saves significant time and effort — just set your target size and go. ## Real-World Validation ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for [Pi](https://pi.dev/), an autonomous coding agent that uses `llama.cpp` as its LLM backend. At `temperature 0.6`, the model was tasked with building a complete personal finance dashboard as a single HTML file — Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (`date.now` → `date.getTime`), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final `finance-dashboard.html` was a polished, production-quality single-page app — no external dependencies, no hallucinations, no broken features. This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie — ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch. ## How It Works ### 1. Floor Assignment Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads deploy at Q8_0. With `--allow-q3-or-lower`, low-importance tensors (`ffn_down`, `attn_output`, `ssm_out`) start as low as IQ2_XXS, giving the priority queue more room to upgrade important tensors to Q8_0. Tensors missing imatrix data are kept at Q4_K to avoid garbage at low bitrates. ### 2. Importance Imatrix `in_sum2` measures how much each weight contributes to the output variance. Layer position weighting was tested but showed no PPL benefit and has been removed. ### 3. Tied Group Detection Tensors with numerically identical `in_sum2` arrays are tied (shared weights). They form a single upgrade group — all members upgrade together as one unit. Group importance is the **sum** of its members' importance, preventing large groups from being starved of budget. ### 4. Priority Queue Drain All possible single-tier upgrades are pushed into a max-heap: ``` utility/MiB = sum(timp[group]) × (MSE(cur) − MSE(next)) / (size(next) − size(cur)) ``` MSE per tier is theoretical: `MSE = 2^(-2 × bpw)`. K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NL→Q4_K is a free quality gain. The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades. ## Why It Works | Problem | ASHQ1 Solution | |---------|---------------| | Uniform quant wastes bits on low-importance tensors | Priority queue allocates budget where it matters | | Heuristic hand-tuning doesn't scale | Single knob: `--size` in MiB | | Hand-tuned SHQ hybrids need days of PPL sweeps | Queue converges in ~1 sec for any budget | | Large tied groups starved by per-tensor logic | `sum(timp)` prevents 32× group penalty | | IQ4_NL→Q4_K at same bpw is a no-op | Free-upgrade pass catches zero-cost quality gains | | No PPL-per-budget curve needed | Queue optimises for MSE directly | | Tensors without imatrix crash at low bitrates | `has_imatrix` check falls back to Q4_K floor | ## Supported Architectures | Arch | Detection | Features | |------|-----------|----------| | `qwen35` | SSM + QKV | Hybrid attention, SSM layers, GQA, **MTP support** | | `mellum2` | MoE (`exps` tensors) | Mixture of Experts, GQA, router F16 | | `gemma4` | Layer-scale norms | QAT support, Q4_K attention floor | MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q8_0 and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with `nextn.*` or layers beyond `n_layers` are detected as MTP at runtime. ### Looking for: Qwen3.6 support Qwen3.6 is one of the most capable local LLMs right now, but I can't handle it on my hardware. The BF16 source is ~55 GB — I don't have enough RAM to even load it, let alone quantize. If you have access to a Qwen3.6 GGUF (any quantization) and can run `llama-imatrix` on it — or if you'd like to collaborate on adding architecture detection — please reach out. I can handle the integration, I just need the raw tensor names and imatrix data to map out the class system. New architectures can be added via `ARCH_FEATURES` in `constants.py`. ## Code Structure | File | Role | |------|------| | `main.py` | CLI entry point, orchestration, `--show-floors`, multiple `--imatrix` support | | `model_reader.py` | Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime | | `imatrix_reader.py` | Parses imatrix GGUF, detects tied groups via `np.allclose(in_sum2)`, combines multiple imatrix | | `classifier.py` | Floor assignment → tied group building → priority queue drain → free upgrade pass | | `config_generator.py` | Generates `--tensor-type` regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges) | | `quantizer.py` | Subprocess wrapper around `llama-quantize` | | `constants.py` | TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES | ## Usage ### Quantization ```bash pip install -r requirements.txt # Dry run (∼1 sec) python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 # Actual quant (∼10 min) python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run # Show hard floors python main.py --show-floors # Multiple imatrix (combined with max/mean) python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \ --imatrix-method max --size 6800 --run # Allow low-bit tensors (IQ2_XXS through Q8_0 spread) python main.py --model model.gguf --imatrix imatrix.gguf --size 6000 \ --allow-q3-or-lower --run ``` The `llama-quantize` binary path is set in `quantizer.py:6`. ### Inference (llama-server) Recommended server flags for serving ASHQ1 quants: ```bash ./build/bin/llama-server \ -m model-ASHQ1.gguf \ -c 50000 \ --jinja \ -fit off \ -ngl 99 \ --flash-attn on \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --port 8080 \ --mmap \ --temp 1.0 \ --top-p 0.95 \ --min-p 0 \ --top-k 20 \ --seed -1 \ --parallel 1 ``` ## Tier Reference | Tier | BPW | MSE_BPW | |------|:---:|:-------:| | F16 | 16.0 | 16.0 | | Q8_0 | 8.50 | 8.50 | | Q6_K | 6.5625 | 6.5625 | | Q5_K | 5.50 | 5.50 | | Q4_K | 4.50 | 4.50 | | IQ4_NL | 4.50 | (2) | | IQ4_XS | 4.25 | 4.25 | | Q3_K | 3.4375 | 3.4375 | | IQ3_M | 3.66 | — | | IQ3_S | 3.44 | 3.44 | | IQ3_XXS | 3.0625 | 3.0625 | | IQ2_S | 2.50 | 2.50 | | IQ2_XS | 2.3125 | 2.3125 | | IQ2_XXS | 2.0625 | 2.0625 | | IQ1_S | 1.5625 | 1.5625 | > (2) IQ4_NL uses IQ4_XS MSE_BPW for the free-upgrade pass (same real bpw as Q4_K). ## Quantization Configs Generated configs are valid `llama-quantize` arguments with ECMAScript-compatible regex patterns. Each `--tensor-type` rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges: - `(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0` — specific attention layers at Q8_0 - `(blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_K` — range of FFN layers at Q6_K - `.*output_norm.*=F16` — global catch-all Rules are sorted by specificity (specific layers, high tiers first) because `llama-quantize` uses first-match-wins. ## References - [ASHQ1 repo](https://huggingface.co/wepiqx/ASHQ1) - [GGUF specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) - [llama.cpp](https://github.com/ggerganov/llama.cpp)