Mixed Precision GGUF layer quantization of Qwen3.5-9B by Qwen
Original model: https://huggingface.co/Qwen/Qwen3.5-9B
The mixed precision quant employs different quantization levels on a per layer basis to enable both high performance and small file size at the same time. The quants employed are all K to avoid slow CPU or older GPU processing of IQ quants. For this file the layer quants are as follows:
Q4_K_L : Q4_K_M + attn_o = q6_k
Q5_K_L : attn_v = q8_0 attn_o = q6_k ffn_d = q6_k
Q6_K_S : Q6_K
LAYER_TYPES='[
[0 ,"Q5_K_M"], [1 ,"Q5_K_S"], [2 ,"Q4_K_L"], [3 ,"Q4_K_M"], [4 ,"Q4_K_S"], [5 ,"Q4_K_M"], [6 ,"Q4_K_S"], [7 ,"Q4_K_M"],
[8 ,"Q4_K_S"], [9 ,"Q4_K_S"], [10,"Q4_K_S"], [11,"Q4_K_S"], [12,"Q4_K_M"], [13,"Q4_K_S"], [14,"Q4_K_M"], [15,"Q4_K_S"],
[16,"Q4_K_M"], [17,"Q4_K_S"], [18,"Q4_K_M"], [19,"Q4_K_M"], [20,"Q4_K_M"], [21,"Q4_K_M"], [22,"Q4_K_M"], [23,"Q4_K_M"],
[24,"Q4_K_M"], [25,"Q4_K_M"], [26,"Q4_K_M"], [27,"Q4_K_L"], [28,"Q5_K_S"], [29,"Q5_K_M"], [30,"Q5_K_L"], [31,"Q6_K_S"]
]'
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high"
The layer quants were optimized for very strong performance across a set of curated reasoning prompts. The final quant size is about 0.6B bigger than Q4_K_M.
Comparison:
| Quant | size | PPL | Comment |
|---|---|---|---|
| Q4_K_M | 5.6e9 | 7.7 | Q4_K_M with default embedding and output |
| Q4_K_H | 6.1e9 | 7.8 | Mixed precision quant with Q6_K embedding Q6_K |
Usage:
Qwen3.5-9B is a vision capable dense RL model. It can be used together with its multimedia projector layers to process images and text inputs and generate text outputs. The mmproj file is made available in this repository.
Speculation does not work with the model due to the attention scheme it uses. On a 4070 with all layers and context in VRAM with no vision tower approx performance is:
| Q | QKV | NKV | gen tps |
|---|---|---|---|
| Q4_K_H | F16 | 166k + | 63 |
| Q4_K_H | Q8_0 | 280k + | 64 |
Long context test (needle in haystack) was tested and passed with fast prompt processing, making large context actually usable with the model.
The model appears to be trained to decide itself whether to do a think block or not. When it does a think block it falls into very heavy overthinking but does come up with accurate answers. Over a small set of eval prompts the model did extremely well. To avoid the overthinking inject think start and think stop tokens first thing after assistant prompt:
THINK_START="<think>\n"
THINK_STOP="\n</think>\n\n"
If the model doesnt feel like doing thinking on a given prompt it will automatically do this. To force the model into a think block inject a bootstrap think block following the assistant prompt:
"<think>\nHere's a thinking process to solve the problem:"
The model was found to be highly capable on reasoning tasks when skipping think block, with zero overthinking, just accurate direct deductions to final solutions. When doing thinking with greedy sampling the model went into an infinite rep loop on one of the test prompts (somewhat tricky question it had trouble resolving) but did well on all the remaining prompts. This is similar behaviour to other qwen3 thinkers which have trouble with infinite repeat when using greedy sampling particularly at smaller quant sizes (<10B params)
The model was tested in vision mode on a couple pretty tough bird ID image and did well, with a very detailed think block and accurate final conclusion.
The model was tested across a small set of code gen prompts and found to be quite intermittent in its ability to generate working code, and went into infinite repeat on two of the code prompts where it decided to use a think block when using greedy sampling.
Llama.cpp minimum version to run Qwen3.5-9B should be b8148 and above due to correction of a graph error which causes crashes in both RPC and multiple local GPU setups. If the model is split over multiple GPUs it will probably crash due to an unresolved problem with Qwen3 next based models being run over multiple GPU: https://github.com/ggml-org/llama.cpp/issues/19892
Benchmarks:
A full set of both math and vision benchmarks for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm
Download the file from below:
| Link | Type | Size/e9 B | Notes |
|---|---|---|---|
| Qwen3.5-9B.Q4_K_H.gguf | Q4_K_H | 6.1e9 B | 0.6B bigger than Q4_K_M |
| Qwen3.5-9B.mmproj.gguf | F16 | 0.92e9 B | multimedia projector |
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository:
- Downloads last month
- 575