Qwen3.5-9B MXFP4

MXFP4 quantized version of Qwen3.5-9B (9B parameters, dense, hybrid Gated DeltaNet + Gated Attention).

MLP weights only are quantized to MXFP4 (4-bit microscaling with e8m0 shared exponents, block size 32). All attention, linear attention (Gated DeltaNet), visual encoder, MTP layers, embeddings, and normalization layers remain in BF16.

	Original (BF16)	This model (MXFP4)
Size on disk	19 GB	12 GB
Perplexity (wikitext, 2048 ctx)	8.55	8.30

Model Details

Architecture: Qwen3.5 dense — hybrid Gated DeltaNet + Gated Attention with 32 layers
Parameters: 9B
Context length: 262,144 tokens
Vocabulary: 248,320 tokens

What's quantized

Component	Precision	Notes
MLP gate_proj, up_proj, down_proj	MXFP4 (uint8 packed + e8m0 scales)	2D standard linear weights
Self-attention (Q/K/V/O projections)	BF16	Excluded — preserves attention quality
Linear attention (Gated DeltaNet layers)	BF16	Excluded
Visual encoder	BF16	Excluded
MTP layers	BF16	Excluded
Embeddings, LM head	BF16	Excluded
LayerNorm weights	BF16	1D, not quantizable

Quantization method

Format: MXFP4 — 4-bit float (E2M1) with shared e8m0 block exponent per 32 elements
Scale selection: MSE-optimal over 3 candidate exponents per block (not simple rounding)
Output format: compressed-tensors with mxfp4-pack-quantized — compatible with stock vLLM

Usage

vLLM

pip install vllm

vllm serve olka-fi/Qwen3.5-9B-MXFP4 \
    --quantization compressed-tensors \
    --gpu-memory-utilization 0.95 \
    --max-model-len 4096

Note: Requires vLLM with Qwen3.5 architecture support (not yet in stock vLLM 0.16.0).

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="Qwen3.5-9B-MXFP4",
    messages=[{"role": "user", "content": "Who are you?"}],
)
print(response.choices[0].message.content)

Quantization Details

Quantized with qstream — custom MXFP4 quantization tool
MSE-optimal 3-candidate scale selection per block (32 elements)
Per-block shared exponent in e8m0 format
Exclude patterns: *self_attn*, *linear_attn*, *lm_head*, *embed_tokens*, *visual*, *mtp*

Acknowledgments

Based on Qwen3.5-9B by Tongyi Lab (Alibaba).

Downloads last month: 70

Safetensors

Model size

7B params

Tensor type

F32

BF16

Model tree for olka-fi/Qwen3.5-9B-MXFP4

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(174)

this model