ASHQ1 โ Autonomous Selective Hybrid Quantization
ASHQ1 is a post-training quantization method for GGUF models that utilizes an imatrix-driven priority queue to maximize theoretical signal-to-noise ratio per megabyte. Instead of uniform bit-depth allocation or heuristic layer-blocking, ASHQ1 treats tied tensor groups as monolithic entities and greedily upgrades them based on strict mathematical utility.
The Breakthrough
By replacing empirical quality heuristics with theoretical MSE reduction, ASHQ1 achieves better perplexity than uniform higher-bit quantization while being significantly smaller.
| Quant | Method | Size | PPL (ctx=1024) | vs Q6_K |
|---|---|---|---|---|
| Q6_K (baseline) | Uniform | 7,008 MiB | 7.5876 ยฑ 0.0495 | โ |
| ASHQ1 v6 | Priority queue + MSE | 5,713 MiB | 7.5411 ยฑ 0.0487 | -0.047 |
ASHQ1 at 5700 MiB beats uniform Q6_K by 0.047 PPL at 19% smaller size.
Note: The model file is being uploaded now. My internet connection is very slow (~100 KB/s), so a full 5.6 GB upload takes around 20 hours. If the GGUF file is not yet available in the repo, it is still being uploaded. Please be patient.
I have put a lot of effort into developing this quantization method. ASHQ1 may not be released as open-source on GitHub due to a shadowban on my account and the difficulty of maintaining the project. This HF repo is the primary distribution channel.
How It Works
The classifier operates in a single-pass max-heap to drain a strictly defined size budget.
1. Initialization (Strict Floors)
All upgradeable tensors start at a Q4_K floor. Specific architectural classes are hardcoded to prevent degradation:
norms/ssm_paramsโF16token_embdโQ5_K- MTP head (blk.32) โ
Q5_K - Everything else โ
Q4_K
Note: The Q4_K floor is critical. Earlier iterations starting at IQ4_XS suffered PPL stagnation because non-linear 4-bit blocks cause disproportionately high activation noise in deep layers. The strict Q4_K floor eliminates this entirely.
2. Importance Weighting
Tensor importance is derived from the imatrix, scaled by architectural depth:
timp[t] = imp[t] ร depth_factor(layer)
- First layer (0):
2.0x - Last 5 layers:
1.5x - Middle layers:
1.0x
3. Tied Group Aggregation
Numerically identical tensors (e.g., ffn_gate = ffn_up) are detected and treated as single monolithic entities in the queue. Their importance is summed (sum(timp)), making the utility metric scale-invariant regardless of group size.
4. The Priority Queue
All possible single-step upgrades are pushed into a max-heap. The utility metric is defined as:
utility_per_mb = sum(timp[group]) ร ฮMSE / cost_delta
Where the theoretical MSE reduction is:
ฮMSE = 2^(-2 ร bpw_cur) - 2^(-2 ร bpw_next)
The queue drains by popping the highest utility-per-MB upgrade, applying it, and pushing the next possible upgrade for that group until the target size budget is exhausted. Zero-cost upgrades are assigned inf priority to ensure they always apply.
MSE_BPW Calibration
The effective bits-per-weight used for MSE calculation. Note that IQ4_XS is empirically lowered to 4.00 from its theoretical 4.25 to reflect its actual noise profile in deep transformers.
| Tier | MSE_BPW |
|---|---|
| F16 | 16.0 |
| Q8_0 | 8.50 |
| Q6_K | 6.5625 |
| Q5_K | 5.50 |
| Q4_K | 4.50 |
| IQ4_NL | 4.40 |
| IQ4_XS | 4.00 (empirically corrected) |
| Q3_K | 3.4375 |
This Quant
| Property | Value |
|---|---|
| File | Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf |
| Size | 5,713 MiB (5.6 GB) |
| Target | 5,700 MiB |
| Accuracy | +13 MiB (GGUF overhead) |
| Base type | Q5_K_M |
| PPL | 7.5411 ยฑ 0.04865 |
| MTP head tier | Q5_K |
| Tier distribution | Q5_K=68, Q6_K=97, Q4_K=100, F16=177 |
Speed (GTX 1070 + MTP Speculation)
| Mode | Tokens/sec |
|---|---|
| MTP speculation | ~34 t/s |
Note: At 5700 MiB, the budget is too tight to allocate Q8_0 to attention tensors. The MSE queue correctly sacrifices inference speed for maximum PPL at this extreme compression level. At larger budgets (6800 MiB+), the queue organically upgrades attention to Q8_0 to improve decoding speed without sacrificing PPL.
Usage
MTP Speculative Decoding
llama-cli \
-m Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf \
--spec-type draft-mtp --spec-draft-n-max 2 \
-p "Your prompt" \
-ngl 99 --flash-attn on \
-c 4096
Recommended Sampling
temperature 0.6, top_k 20, top_p 0.95, min_p 0. For looping, repeat_penalty 1.05.
Reproducibility
Full llama-quantize command generated by the ASHQ1 classifier:
/home/maxyag27/llm-tools/llama.cpp/build/bin/llama-quantize \
--imatrix /mnt/everything/qwen/source/Qwable-9B-Claude-Fable-5.imatrix.gguf \
--output-tensor-type Q5_K \
--token-embedding-type Q5_K \
--tensor-type "(blk|BLK)\.(32)\.nextn=Q5_K" \
--tensor-type "(blk|BLK)\.(31)\.attn_output=Q6_K" \
--tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_beta=Q6_K" \
--tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_alpha=Q6_K" \
--tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.attn_qkv=Q6_K" \
--tensor-type "(blk|BLK)\.(0|(?:9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.attn_gate=Q6_K" \
--tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k=Q6_K" \
--tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_q=Q6_K" \
--tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_v=Q6_K" \
--tensor-type "(blk|BLK)\.(32)\.attn_k=Q5_K" \
--tensor-type "(blk|BLK)\.(32)\.post_attention_norm=Q5_K" \
--tensor-type "(blk|BLK)\.(32)\.attn_v=Q5_K" \
--tensor-type "(blk|BLK)\.(32)\.attn_k_norm=Q5_K" \
--tensor-type "(blk|BLK)\.(32)\.attn_q_norm=Q5_K" \
--tensor-type "(blk|BLK)\.(32)\.attn_norm=Q5_K" \
--tensor-type "(blk|BLK)\.(32)\.attn_q=Q5_K" \
--tensor-type "(blk|BLK)\.((?:31|32))\.ffn_down=Q5_K" \
--tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30))\.ffn_down=Q4_K" \
--tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))\.post_attention_norm=F16" \
--tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_norm=F16" \
--tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_a=F16" \
--tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31))\.attn_norm=F16" \
--tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_dt=F16" \
--tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26)|(?:28|29|30))\.ssm_conv1d=F16" \
--tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_k_norm=F16" \
--tensor-type "(blk|BLK)\.(3|7|11|15|19|23|27|31)\.attn_q_norm=F16" \
--tensor-type "(blk|BLK)\.((?:21|22|23|24|25|26|27|28|29|30|31|32))\.ffn_up=Q5_K" \
--tensor-type "(blk|BLK)\.(27|32)\.attn_output=Q5_K" \
--tensor-type "(blk|BLK)\.((?:21|22|23|24|25|26|27|28|29|30|31|32))\.ffn_gate=Q5_K" \
--tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.attn_gate=Q5_K" \
--tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.ssm_alpha=Q5_K" \
--tensor-type "(blk|BLK)\.((?:28|29|30))\.ssm_out=Q5_K" \
--tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.attn_qkv=Q5_K" \
--tensor-type "(blk|BLK)\.([1-2]|[4-6]|8)\.ssm_beta=Q5_K" \
--tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20))\.ffn_up=Q4_K" \
--tensor-type "(blk|BLK)\.((?:0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15|16|17|18|19|20))\.ffn_gate=Q4_K" \
--tensor-type "(blk|BLK)\.([0-2]|[4-6]|(?:8|9|10)|(?:12|13|14)|(?:16|17|18)|(?:20|21|22)|(?:24|25|26))\.ssm_out=Q4_K" \
--tensor-type "(blk|BLK)\.(3|7|11|15|19|23)\.attn_output=Q4_K" \
--tensor-type ".*output_norm.*=F16" \
/mnt/everything/qwen/source/Qwythos-9B-Claude-Mythos-5-1M-MTP-BF16.gguf \
Qwythos-9B-Claude-Mythos-5-1M-MTP-ASHQ1-Q5_K_M.gguf
All Results (Qwen3.5-9B fine-tunes)
| Target | Model | MTP | Actual | PPL |
|---|---|---|---|---|
| 5500 | Qwable | -- | 5,503 MiB | 7.4334 |
| 5700 | Qwythos | yes | 5,713 MiB | 7.5411 |
References
- Original Qwythos MTP: https://huggingface.co/empero-ai/Qwythos-9B-Claude-Mythos-5-1M
- ASHQ1 source: https://huggingface.co/wepiqx/ASHQ1
Model tree for wepiqx/ASHQ1
Base model
Qwen/Qwen3.5-9B-Base