Qwen3.5-9B MXFP4

MXFP4 quantized version of Qwen3.5-9B (9B parameters, dense, hybrid Gated DeltaNet + Gated Attention).

MLP weights only are quantized to MXFP4 (4-bit microscaling with e8m0 shared exponents, block size 32). All attention, linear attention (Gated DeltaNet), visual encoder, MTP layers, embeddings, and normalization layers remain in BF16.

Original (BF16) This model (MXFP4)
Size on disk 19 GB 12 GB
Perplexity (wikitext, 2048 ctx) 8.55 8.30

Model Details

  • Architecture: Qwen3.5 dense — hybrid Gated DeltaNet + Gated Attention with 32 layers
  • Parameters: 9B
  • Context length: 262,144 tokens
  • Vocabulary: 248,320 tokens

What's quantized

Component Precision Notes
MLP gate_proj, up_proj, down_proj MXFP4 (uint8 packed + e8m0 scales) 2D standard linear weights
Self-attention (Q/K/V/O projections) BF16 Excluded — preserves attention quality
Linear attention (Gated DeltaNet layers) BF16 Excluded
Visual encoder BF16 Excluded
MTP layers BF16 Excluded
Embeddings, LM head BF16 Excluded
LayerNorm weights BF16 1D, not quantizable

Quantization method

  • Format: MXFP4 — 4-bit float (E2M1) with shared e8m0 block exponent per 32 elements
  • Scale selection: MSE-optimal over 3 candidate exponents per block (not simple rounding)
  • Output format: compressed-tensors with mxfp4-pack-quantized — compatible with stock vLLM

Usage

vLLM

pip install vllm

vllm serve olka-fi/Qwen3.5-9B-MXFP4 \
    --quantization compressed-tensors \
    --gpu-memory-utilization 0.95 \
    --max-model-len 4096

Note: Requires vLLM with Qwen3.5 architecture support (not yet in stock vLLM 0.16.0).

Python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="Qwen3.5-9B-MXFP4",
    messages=[{"role": "user", "content": "Who are you?"}],
)
print(response.choices[0].message.content)

Quantization Details

  • Quantized with qstream — custom MXFP4 quantization tool
  • MSE-optimal 3-candidate scale selection per block (32 elements)
  • Per-block shared exponent in e8m0 format
  • Exclude patterns: *self_attn*, *linear_attn*, *lm_head*, *embed_tokens*, *visual*, *mtp*

Acknowledgments

Based on Qwen3.5-9B by Tongyi Lab (Alibaba).

Downloads last month
70
Safetensors
Model size
7B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for olka-fi/Qwen3.5-9B-MXFP4

Finetuned
Qwen/Qwen3.5-9B
Quantized
(174)
this model