Qwopus3.5-9B-v3-NVFP4
NVFP4 (W4A4 FP4) quantization of Jackrong/Qwopus3.5-9B-v3, a Qwen 3.5 9B reasoning and tool-calling model.
| BF16 | NVFP4 (this) | |
|---|---|---|
| Size | 18 GB | 9.6 GB |
| Format | bfloat16 | compressed-tensors NVFP4 |
| Serving | Any | vLLM v0.19+ |
Quickstart (vLLM)
vllm serve mtecnic/Qwopus3.5-9B-v3-NVFP4 \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--trust-remote-code \
--kv-cache-dtype fp8_e5m2 \
--enable-auto-tool-choice \
--tool-call-parser qwen35_coder
Requirements: vLLM v0.19+ with transformers==4.57.6 (the version shipped with the v0.19 Docker image). Do NOT upgrade transformers to 5.x inside the container — vLLM v0.19 uses its own internal Qwen 3.5 config which conflicts with transformers 5.x classes.
Usage (OpenAI-compatible API)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")
response = client.chat.completions.create(
model="mtecnic/Qwopus3.5-9B-v3-NVFP4",
messages=[{"role": "user", "content": "Write a Python function to check if a number is prime."}],
max_tokens=512,
)
print(response.choices[0].message.content)
Limitations
- No vision/image capability — Vision encoder weights are not included (see below)
- vLLM only — Requires vLLM v0.19+ with transformers 4.57.6; not compatible with transformers 5.x in the serving container
- Partial quantization — 75% of attention layers (Gated DeltaNet) are kept at full precision, so the compression ratio is lower than fully-quantized models
- Tokenizer regex warning — A harmless Mistral-inherited regex pattern warning may appear; does not affect tokenization quality
Important: Text-Only Model
This quantization contains text weights only. The base model (Jackrong/Qwopus3.5-9B-v3) is built on Qwen 3.5 9B which has a multimodal architecture (Qwen3_5ForConditionalGeneration), but this checkpoint was quantized via AutoModelForCausalLM, so vision encoder weights are not included.
The config.json retains Qwen3_5ForConditionalGeneration architecture and vision_config solely for vLLM v0.19 compatibility (vLLM has no registered handler for Qwen3_5ForCausalLM). Image and video inputs will not work.
Quantization Details
Quantized using llm-compressor with QuantizationModifier(scheme="NVFP4").
Layers kept at full precision (not quantized):
lm_head— Output head (248K vocab), precision-critical for token probabilities- All
linear_attn.*layers (24 of 32 decoder layers) — Gated DeltaNet linear attention layers use delta-rule memory updates and gating projections that are sensitive to quantization, per official llm-compressor guidance
Layers quantized to NVFP4:
self_attn.*(q/k/v/o projections) — Full softmax attention on layers 3, 7, 11, 15, 19, 23, 27, 31mlp.*(gate/up/down projections) — SwiGLU MLP on all 32 layers
Calibration: 256 samples from allenai/tulu-3-sft-mixture, max_seq_length=512.
Config Modifications for vLLM
The following config changes were made post-quantization for vLLM v0.19 compatibility:
| Field | Original | Modified | Reason |
|---|---|---|---|
model_type |
qwen3_5_text |
qwen3_5 |
vLLM only recognizes qwen3_5 |
architectures |
Qwen3_5ForCausalLM |
Qwen3_5ForConditionalGeneration |
vLLM only registers ConditionalGeneration |
tokenizer_class |
TokenizersBackend |
Qwen2TokenizerFast |
transformers 4.57.6 compat |
quantization_config.ignore |
model.layers.* paths |
model.language_model.layers.* paths |
Match weight key naming |
Architecture
Qwen 3.5 9B is a dense transformer with hybrid attention:
- 32 decoder layers, hidden_size=4096, vocab=248,320
- 75% Gated DeltaNet (linear attention), 25% full softmax attention
- GQA: 16 query heads, 4 KV heads
- SwiGLU MLP, RMSNorm, RoPE (theta=10M)
License
Apache-2.0, same as the base model.
Acknowledgments
- Jackrong/Qwopus3.5-9B-v3 — Base model
- Qwen/Qwen3.5-9B — Foundation model
- vllm-project/llm-compressor — Quantization toolkit
- Downloads last month
- 231
Model tree for mtecnic/Qwopus3.5-9B-v3-NVFP4
Base model
Qwen/Qwen3.5-9B-Base