Qwen-3.5 Collection
Collection
Quantized Qwen3.5 models for efficient image-text understanding (AutoRound W4A16). • 6 items • Updated • 1
This is a W4A16 (4-bit weight, 16-bit activation) quantized version of Qwen/Qwen3.5-9B, produced using AutoRound — Intel's sign gradient descent based quantization method designed for production-grade accuracy retention. MTP Enabled model quantization
| Parameter | Value |
|---|---|
| Method | AutoRound (W4A16) |
| Group Size | 128 |
| Symmetric | Yes |
| Iterations | 800 |
| Calibration Samples | 512 |
| Sequence Length | 2048 |
| Torch Compile | Enabled |
This model supports Multi-Token Prediction (MTP) for improved inference throughput using speculative decoding.
When serving with compatible backends (e.g., vLLM), enable MTP using:
--speculative_config '{"method":"mtp","num_speculative_tokens":1}'
num_speculative_tokens=1 is a stable default for balancing speed and accuracy.This model is compatible with transformers and backends that support AutoRound GPTQ-format weights (e.g., vLLM, SGLang, AutoGPTQ). For full model details, architecture, and capabilities, refer to the base model page.