---
license: mit
language:
- en
pipeline_tag: text-generation
library_name: gguf
tags:
- quantization
- gguf
- llama-cpp
- imatrix
- hybrid-quantization
- selective-quantization
- priority-queue
- mse
- theoretical-optimization
- qwen3.5
- gemma4
- moe
- mtp
---

# ASHQ1 — Autonomous Selective Hybrid Quantization

> ⚠️ **Experimental.** ASHQ1 is a personal research project that I will be refining over time. Use at your own risk. Results may vary between architectures and fine-tunes. Feedback and contributions welcome.

**Latest update (v6):** The classifier has been overhauled — the empirical depth-weighting heuristic was removed after A/B testing confirmed it added zero value. Quality improved as a result. The same budget now goes further.

ASHQ1 is a post-training quantization method for GGUF models that uses an **imatrix-driven priority queue** to maximise theoretical quality per megabyte. Instead of uniform bit-depth or heuristic layer-blocking, it treats tied tensor groups as monolithic entities and greedily upgrades them by strict mathematical utility — the product of summed importance and theoretical MSE reduction, divided by size cost.

## Results

| Method | Model | Size | PPL (ctx 1024) | Δ vs Uniform |
|:-------|:------|:----:|:--------------:|:------------:|
| **ASHQ1** (v6) | Ornith-1.0-9B-MTP | 6012 MiB | **7.4697 ± 0.04862** | **−0.1551** |
| Uniform Q6_K | Ornith-1.0-9B-MTP | 7198 MiB | 7.6248 ± 0.05039 | baseline |

ASHQ1 beats uniform Q6_K by **0.155 PPL** while being **16.5% smaller** (−1186 MiB). The current classifier (v6) dropped empirical depth-weighting heuristics — the theoretical priority queue now works even better.

ASHQ1 is often on par with hand-tuned SHQ quants in quality, and sometimes surpasses them. At the same time, it saves significant time and effort — just set your target size and go.


## Real-World Validation

ASHQ1's theoretical quality advantage transfers to real agentic coding. We tested Ornith-1.0-9B ASHQ1 6500 (6.4 GB, 33% smaller than Q8_0) as the backend for [Pi](https://pi.dev/), an autonomous coding agent that uses `llama.cpp` as its LLM backend.

At `temperature 0.6`, the model was tasked with building a complete personal finance dashboard as a single HTML file — Canvas charts, budget tracker, dark mode, transaction filtering, upcoming bills, responsive layout. The agent worked autonomously: planned the architecture, wrote the entire ~1100-line file, caught its own bugs (`date.now` → `date.getTime`), fixed dark mode logic, ran Node.js validation, and iterated until all checks passed. The final `finance-dashboard.html` was a polished, production-quality single-page app — no external dependencies, no hallucinations, no broken features.

This is not cherry-picked. It's the first test we ran. The benchmarks didn't lie — ASHQ1 preserves enough quality that a 6.4 GB quant can drive an autonomous coding agent to build complete, working applications from scratch.

## How It Works

### 1. Floor Assignment

Every tensor starts at a minimum tier by class. SSM params and norms lock at F16. Embeddings start at Q5_K. Weight matrices start at Q4_K (or IQ4_XS for QAT models). MTP heads deploy at Q8_0.

With `--allow-q3-or-lower`, low-importance tensors (`ffn_down`, `attn_output`, `ssm_out`) start as low as IQ2_XXS, giving the priority queue more room to upgrade important tensors to Q8_0. Tensors missing imatrix data are kept at Q4_K to avoid garbage at low bitrates.

### 2. Importance

Imatrix `in_sum2` measures how much each weight contributes to the output variance. Layer position weighting was tested but showed no PPL benefit and has been removed.

### 3. Tied Group Detection

Tensors with numerically identical `in_sum2` arrays are tied (shared weights). They form a single upgrade group — all members upgrade together as one unit. Group importance is the **sum** of its members' importance, preventing large groups from being starved of budget.

### 4. Priority Queue Drain

All possible single-tier upgrades are pushed into a max-heap:

```
utility/MiB = sum(timp[group]) × (MSE(cur) − MSE(next)) / (size(next) − size(cur))
```

MSE per tier is theoretical: `MSE = 2^(-2 × bpw)`. K-quants get +0.1 effective bpw vs IQ-quants at the same real bpw, so IQ4_NL→Q4_K is a free quality gain.

The queue pops the highest-utility upgrade, applies it, pushes the next upgrade for that group, and drains until the budget is exhausted. A final pass catches any remaining zero-cost upgrades.

## Why It Works

| Problem | ASHQ1 Solution |
|---------|---------------|
| Uniform quant wastes bits on low-importance tensors | Priority queue allocates budget where it matters |
| Heuristic hand-tuning doesn't scale | Single knob: `--size` in MiB |
| Hand-tuned SHQ hybrids need days of PPL sweeps | Queue converges in ~1 sec for any budget |
| Large tied groups starved by per-tensor logic | `sum(timp)` prevents 32× group penalty |
| IQ4_NL→Q4_K at same bpw is a no-op | Free-upgrade pass catches zero-cost quality gains |
| No PPL-per-budget curve needed | Queue optimises for MSE directly |
| Tensors without imatrix crash at low bitrates | `has_imatrix` check falls back to Q4_K floor |

## Supported Architectures

| Arch | Detection | Features |
|------|-----------|----------|
| `qwen35` | SSM + QKV | Hybrid attention, SSM layers, GQA, **MTP support** |
| `mellum2` | MoE (`exps` tensors) | Mixture of Experts, GQA, router F16 |
| `gemma4` | Layer-scale norms | QAT support, Q4_K attention floor |

MTP (Multi-Token Prediction) heads are handled explicitly: MTP tensors deploy at Q8_0 and are excluded from the classifier's budget (their cost is subtracted from the target upfront). Tensor names with `nextn.*` or layers beyond `n_layers` are detected as MTP at runtime.

### Looking for: Qwen3.6 support

Qwen3.6 is one of the most capable local LLMs right now, but I can't handle it on my hardware. The BF16 source is ~55 GB — I don't have enough RAM to even load it, let alone quantize. If you have access to a Qwen3.6 GGUF (any quantization) and can run `llama-imatrix` on it — or if you'd like to collaborate on adding architecture detection — please reach out. I can handle the integration, I just need the raw tensor names and imatrix data to map out the class system.

New architectures can be added via `ARCH_FEATURES` in `constants.py`.

## Code Structure

| File | Role |
|------|------|
| `main.py` | CLI entry point, orchestration, `--show-floors`, multiple `--imatrix` support |
| `model_reader.py` | Reads GGUF, detects architecture/prefix/n_layers/MTP at runtime |
| `imatrix_reader.py` | Parses imatrix GGUF, detects tied groups via `np.allclose(in_sum2)`, combines multiple imatrix |
| `classifier.py` | Floor assignment → tied group building → priority queue drain → free upgrade pass |
| `config_generator.py` | Generates `--tensor-type` regex rules from classified tensors (valid ECMAScript regex with pipe-alternated ranges) |
| `quantizer.py` | Subprocess wrapper around `llama-quantize` |
| `constants.py` | TENSOR_CLASS mapping, CLASS_HARD_FLOORS, CLASS_MAX_TIER, MSE_BPW, TIER_BPW, ARCH_FEATURES |

## Usage

### Quantization

```bash
pip install -r requirements.txt

# Dry run (∼1 sec)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800

# Actual quant (∼10 min)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6800 --run

# Show hard floors
python main.py --show-floors

# Multiple imatrix (combined with max/mean)
python main.py --model model.gguf --imatrix i1.gguf --imatrix i2.gguf \
  --imatrix-method max --size 6800 --run

# Allow low-bit tensors (IQ2_XXS through Q8_0 spread)
python main.py --model model.gguf --imatrix imatrix.gguf --size 6000 \
  --allow-q3-or-lower --run
```

The `llama-quantize` binary path is set in `quantizer.py:6`.

### Inference (llama-server)

Recommended server flags for serving ASHQ1 quants:

```bash
./build/bin/llama-server \
  -m model-ASHQ1.gguf \
  -c 50000 \
  --jinja \
  -fit off \
  -ngl 99 \
  --flash-attn on \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080 \
  --mmap \
  --temp 1.0 \
  --top-p 0.95 \
  --min-p 0 \
  --top-k 20 \
  --seed -1 \
  --parallel 1
```

## Tier Reference

| Tier | BPW | MSE_BPW |
|------|:---:|:-------:|
| F16 | 16.0 | 16.0 |
| Q8_0 | 8.50 | 8.50 |
| Q6_K | 6.5625 | 6.5625 |
| Q5_K | 5.50 | 5.50 |
| Q4_K | 4.50 | 4.50 |
| IQ4_NL | 4.50 | (2) |
| IQ4_XS | 4.25 | 4.25 |
| Q3_K | 3.4375 | 3.4375 |
| IQ3_M | 3.66 | — |
| IQ3_S | 3.44 | 3.44 |
| IQ3_XXS | 3.0625 | 3.0625 |
| IQ2_S | 2.50 | 2.50 |
| IQ2_XS | 2.3125 | 2.3125 |
| IQ2_XXS | 2.0625 | 2.0625 |
| IQ1_S | 1.5625 | 1.5625 |

> (2) IQ4_NL uses IQ4_XS MSE_BPW for the free-upgrade pass (same real bpw as Q4_K).

## Quantization Configs

Generated configs are valid `llama-quantize` arguments with ECMAScript-compatible regex patterns. Each `--tensor-type` rule matches a group of tensors that share the same target tier, with layers grouped into contiguous ranges:

- `(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q8_0` — specific attention layers at Q8_0
- `(blk|BLK)\.((?:22|23|24|25|26))\.ffn_gate=Q6_K` — range of FFN layers at Q6_K
- `.*output_norm.*=F16` — global catch-all

Rules are sorted by specificity (specific layers, high tiers first) because `llama-quantize` uses first-match-wins.

## References

- [ASHQ1 repo](https://huggingface.co/wepiqx/ASHQ1)
- [GGUF specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
- [llama.cpp](https://github.com/ggerganov/llama.cpp)