Vishva007/Qwen3.5-9B-W4A16-AutoRound

This is a W4A16 (4-bit weight, 16-bit activation) quantized version of Qwen/Qwen3.5-9B, produced using AutoRound — Intel's sign gradient descent based quantization method designed for production-grade accuracy retention. MTP Enabled model quantization

Quantization Details

Parameter Value
Method AutoRound (W4A16)
Group Size 128
Symmetric Yes
Iterations 800
Calibration Samples 512
Sequence Length 2048
Torch Compile Enabled

Key Notes

  • High accuracy configuration — 800 iterations with 512 calibration samples targets production-grade quality with minimal degradation from the base model.
  • W4A16 — Weights are quantized to 4-bit integers; activations remain in FP16 for inference stability.
  • ~50% memory reduction compared to the FP16 base model, enabling deployment on consumer and mid-range GPUs.
  • MTP (Multi-Token Prediction) Enabled — Supports speculative decoding for faster inference.

MTP / Speculative Decoding

This model supports Multi-Token Prediction (MTP) for improved inference throughput using speculative decoding.

When serving with compatible backends (e.g., vLLM), enable MTP using:

--speculative_config '{"method":"mtp","num_speculative_tokens":1}'

Notes

  • num_speculative_tokens=1 is a stable default for balancing speed and accuracy.
  • You can experiment with higher values for better throughput, depending on your hardware and latency requirements.

Usage

This model is compatible with transformers and backends that support AutoRound GPTQ-format weights (e.g., vLLM, SGLang, AutoGPTQ). For full model details, architecture, and capabilities, refer to the base model page.

Downloads last month
155
Safetensors
Model size
3B params
Tensor type
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vishva007/Qwen3.5-9B-W4A16-AutoRound

Finetuned
Qwen/Qwen3.5-9B
Quantized
(163)
this model

Collection including Vishva007/Qwen3.5-9B-W4A16-AutoRound