Qwopus3.5-9B-v3-NVFP4

NVFP4 (W4A4 FP4) quantization of Jackrong/Qwopus3.5-9B-v3, a Qwen 3.5 9B reasoning and tool-calling model.

BF16 NVFP4 (this)
Size 18 GB 9.6 GB
Format bfloat16 compressed-tensors NVFP4
Serving Any vLLM v0.19+

Quickstart (vLLM)

vllm serve mtecnic/Qwopus3.5-9B-v3-NVFP4 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --trust-remote-code \
    --kv-cache-dtype fp8_e5m2 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen35_coder

Requirements: vLLM v0.19+ with transformers==4.57.6 (the version shipped with the v0.19 Docker image). Do NOT upgrade transformers to 5.x inside the container — vLLM v0.19 uses its own internal Qwen 3.5 config which conflicts with transformers 5.x classes.

Usage (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="mtecnic/Qwopus3.5-9B-v3-NVFP4",
    messages=[{"role": "user", "content": "Write a Python function to check if a number is prime."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

Limitations

  • No vision/image capability — Vision encoder weights are not included (see below)
  • vLLM only — Requires vLLM v0.19+ with transformers 4.57.6; not compatible with transformers 5.x in the serving container
  • Partial quantization — 75% of attention layers (Gated DeltaNet) are kept at full precision, so the compression ratio is lower than fully-quantized models
  • Tokenizer regex warning — A harmless Mistral-inherited regex pattern warning may appear; does not affect tokenization quality

Important: Text-Only Model

This quantization contains text weights only. The base model (Jackrong/Qwopus3.5-9B-v3) is built on Qwen 3.5 9B which has a multimodal architecture (Qwen3_5ForConditionalGeneration), but this checkpoint was quantized via AutoModelForCausalLM, so vision encoder weights are not included.

The config.json retains Qwen3_5ForConditionalGeneration architecture and vision_config solely for vLLM v0.19 compatibility (vLLM has no registered handler for Qwen3_5ForCausalLM). Image and video inputs will not work.

Quantization Details

Quantized using llm-compressor with QuantizationModifier(scheme="NVFP4").

Layers kept at full precision (not quantized):

  • lm_head — Output head (248K vocab), precision-critical for token probabilities
  • All linear_attn.* layers (24 of 32 decoder layers) — Gated DeltaNet linear attention layers use delta-rule memory updates and gating projections that are sensitive to quantization, per official llm-compressor guidance

Layers quantized to NVFP4:

  • self_attn.* (q/k/v/o projections) — Full softmax attention on layers 3, 7, 11, 15, 19, 23, 27, 31
  • mlp.* (gate/up/down projections) — SwiGLU MLP on all 32 layers

Calibration: 256 samples from allenai/tulu-3-sft-mixture, max_seq_length=512.

Config Modifications for vLLM

The following config changes were made post-quantization for vLLM v0.19 compatibility:

Field Original Modified Reason
model_type qwen3_5_text qwen3_5 vLLM only recognizes qwen3_5
architectures Qwen3_5ForCausalLM Qwen3_5ForConditionalGeneration vLLM only registers ConditionalGeneration
tokenizer_class TokenizersBackend Qwen2TokenizerFast transformers 4.57.6 compat
quantization_config.ignore model.layers.* paths model.language_model.layers.* paths Match weight key naming

Architecture

Qwen 3.5 9B is a dense transformer with hybrid attention:

  • 32 decoder layers, hidden_size=4096, vocab=248,320
  • 75% Gated DeltaNet (linear attention), 25% full softmax attention
  • GQA: 16 query heads, 4 KV heads
  • SwiGLU MLP, RMSNorm, RoPE (theta=10M)

License

Apache-2.0, same as the base model.

Acknowledgments

Downloads last month
231
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mtecnic/Qwopus3.5-9B-v3-NVFP4

Finetuned
Qwen/Qwen3.5-9B
Quantized
(13)
this model

Dataset used to train mtecnic/Qwopus3.5-9B-v3-NVFP4