Qwen3.5-9B-Instruct-NVFP4-GGUF

NVFP4 (NVIDIA Blackwell FP4) GGUF quantization of the Qwen3.5-9B-Instruct multimodal language model with thinking disabled by default.

About NVFP4

NVFP4 is NVIDIA's native 4-bit floating-point format (E4M3 — 1 sign, 4 exponent, 3 mantissa) designed for Blackwell GPU architectures. Key characteristics:

Aspect NVFP4 INT4 (e.g. Q4_K_M)
Format FP4 (E4M3) Integer
Block size Per-tensor Block 32
Dynamic range ~240 (wide) Fixed
Zero representation Exact Exact
Hardware acceleration Blackwell tensor cores CPU / any GPU
Dequantization overhead None (native) Required

When to use NVFP4: You are running on an NVIDIA Blackwell GPU (RTX 5060, 5070, 5080, 5090, B100, B200, etc.) and want maximum performance with native 4-bit tensor core acceleration.

When to use a traditional format (Q4_K_M, Q5_K_M, etc.): You are running on pre-Blackwell hardware (Ampere, Ada Lovelace, Hopper), AMD GPUs, or CPU inference.

Files

File Type Size Description
qwen35-9b-instruct-nvfp4.gguf Text model 5.31 GB Qwen3.5-9B-Instruct text model, NVFP4 quantized
mmproj-qwen35-9b-f16.gguf Vision encoder 0.92 GB Multimodal projector (SigLIP ViT), F16

Quantization Details

Parameter Value
Quantization format NVFP4 (E4M3)
Block size Per-tensor
Bits per weight ~4.74
Hardware target NVIDIA Blackwell (RTX 5000 series, B-series)
VRAM requirement ~3 GB (text) + ~0.7 GB (vision)

Model Description

Qwen3.5-9B-Instruct is a 9 billion-parameter multimodal language model from the Qwen team at Alibaba. It supports:

  • Text generation with instruction following
  • Image understanding (multimodal via SigLIP vision encoder)
  • Code generation and reasoning
  • Multilingual support (English, Chinese, and more)
  • 32 transformer layers, 4096 hidden dimension, GQA + MLA hybrid attention with SSM (Mamba-2) interleaving

Usage

Thinking Control

By default, thinking is disabled. To enable reasoning, set enable_thinking=true:

# llama.cpp CLI
./llama-cli -m qwen35-9b-instruct-nvfp4.gguf \
  --mmproj mmproj-qwen35-9b-f16.gguf \
  --chat-template tokenizer.chat_template \
  -p "What is 2+2?" \
  -n 256

# Enable thinking:
./llama-cli -m qwen35-9b-instruct-nvfp4.gguf \
  --mmproj mmproj-qwen35-9b-f16.gguf \
  -p "<|im_start|>system\nenable_thinking=true<|im_end|>\n<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n" \
  -n 512

llama-cpp-python

from llama_cpp import Llama

llm = Llama(
    model_path="qwen35-9b-instruct-nvfp4.gguf",
    mmproj="mmproj-qwen35-9b-f16.gguf",
    n_ctx=32768,
    chat_format="qwen3",
)

# Without thinking (default):
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
)

# With thinking enabled:
response = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "enable_thinking=true"},
        {"role": "user", "content": "Solve this step by step."},
    ],
)

Download

huggingface-cli download FreedomAISVR/Qwen3.5-9B-Instruct-NVFP4-GGUF \
  --include "*.gguf" \
  --local-dir .

Conversion Pipeline

1. Download original model: Qwen/Qwen3.5-9B-Instruct
2. Convert to F16 GGUF (text):   convert_hf_to_gguf.py --outtype f16 --no-mtp
3. Extract mmproj (vision):      convert_hf_to_gguf.py --mmproj --outtype f16
4. Quantize to NVFP4:            llama-quantize.exe text-f16.gguf output-nvfp4.gguf NVFP4
5. Patch chat template:          Disable thinking by default (enable_thinking opt-in)

Hardware

Component Specification
GPU NVIDIA RTX 5060 Ti (Blackwell)
VRAM 16 GB GDDR7
System RAM 32 GB
Quantization time ~1 min (9B)

License

Apache 2.0 (same as the original Qwen3.5-9B-Instruct model).

Downloads last month
1,807
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for FreedomAISVR/Qwen3.5-9B-Instruct-NVFP4-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(272)
this model