Qwen3.5-4B-W8A8-Dynamic

W8A8 (INT8) quantization of Qwen/Qwen3.5-4B in the compressed-tensors format, ready to serve with vLLM.

Weights: INT8, per-channel, symmetric (static)
Activations: INT8, per-token, dynamic
lm_head and the vision tower are kept in BF16.

vLLM auto-detects the compressed-tensors quantization_config and serves it through its CompressedTensorsW8A8Int8 scheme.

Serving with vLLM

from vllm import LLM, SamplingParams

llm = LLM(
    model="NotaMG/Qwen3.5-4B-W8A8-Dynamic",
    trust_remote_code=True,
    dtype="bfloat16",
)
out = llm.generate(["The capital of France is"], SamplingParams(temperature=0.0, max_tokens=32))
print(out[0].outputs[0].text)

License

This model is derived from Qwen/Qwen3.5-4B; refer to the base model for license terms.

Downloads last month: 24

Safetensors

Model size

5B params

Tensor type

BF16

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NotaMG/Qwen3.5-4B-W8A8-Dynamic

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Quantized

(281)

this model