Qwen3.5-4B-W8A8-Dynamic

W8A8 (INT8) quantization of Qwen/Qwen3.5-4B in the compressed-tensors format, ready to serve with vLLM.

  • Weights: INT8, per-channel, symmetric (static)
  • Activations: INT8, per-token, dynamic
  • lm_head and the vision tower are kept in BF16.

vLLM auto-detects the compressed-tensors quantization_config and serves it through its CompressedTensorsW8A8Int8 scheme.

Serving with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="NotaMG/Qwen3.5-4B-W8A8-Dynamic",
    trust_remote_code=True,
    dtype="bfloat16",
)
out = llm.generate(["The capital of France is"], SamplingParams(temperature=0.0, max_tokens=32))
print(out[0].outputs[0].text)

License

This model is derived from Qwen/Qwen3.5-4B; refer to the base model for license terms.

Downloads last month
24
Safetensors
Model size
5B params
Tensor type
BF16
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for NotaMG/Qwen3.5-4B-W8A8-Dynamic

Finetuned
Qwen/Qwen3.5-4B
Quantized
(281)
this model