| --- |
| base_model: Qwen/Qwen3.6-27B |
| license: other |
| license_name: qwen |
| pipeline_tag: text-generation |
| tags: |
| - rfp458 |
| - 4-bit |
| - quantized |
| - vllm |
| - rocm |
| - rdna4 |
| - qwen3.6 |
| --- |
| |
| # Qwen3.6-27B-RFP458 (4.5 bpw) |
|
|
| A 4.5-bit-per-weight RFP458 quantization of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B), a hybrid vision-language model (Gated-DeltaNet linear-attention + full-attention layers, with a vision tower and MTP head). |
|
|
| ## Summary |
|
|
| - **Format:** RFP458 (`rfp458-pack-quantized`): iq4_nl non-uniform 4-bit codebook, group size 16, signed-int8 block mantissa + per-channel int8 exponent, with hadamard16 weight rotation. |
| - **Size:** ~20.5 GB (vs ~27 GB for the FP8 build); ~9.4 GiB per card on a 2x 32 GB setup. |
| - **What is quantized:** the MLP linears, the GDN `in_proj_qkv` / `in_proj_z` / `out_proj`, and the full self-attention q/k/v/o projections. Embeddings, lm_head, the GDN gating projections (`in_proj_a` / `in_proj_b`), conv1d, norms, and the entire vision tower are kept in bf16. |
|
|
| ## Quality |
|
|
| WikiText-2 perplexity (llama.cpp-compatible, n_ctx 2048, full test set): |
| |
| | Build | Size | PPL | |
| |---|---|---| |
| | **RFP458 (this model)** | 20.5 GB | **6.936** | |
| | FP8 (RedHatAI/Qwen3.6-27B-FP8) | ~27 GB | 7.071 | |
| |
| RFP458 matches or slightly beats the FP8 build at roughly 25 percent smaller weight footprint. |
| |
| ## Serving |
| |
| Built for and validated on a vLLM build with native RFP458 dequant kernels on AMD RDNA4 (gfx1201, Radeon AI PRO R9700), tensor-parallel 2. The smaller weights free enough VRAM to serve the full 262K context with a large KV pool. Note that 4-bit-class formats carry a dequant cost, so single-stream decode runs roughly half the speed of the FP8 build on the same hardware; this is a capacity, footprint, and quality choice rather than a speed one. |
| |
| ## License |
| |
| Inherits the license of the base model, Qwen/Qwen3.6-27B. See the base model card for terms. |
| |
| This is a community quantization and is not affiliated with the original model authors. |
| |