Instructions to use mtecnic/Qwopus3.5-9B-v3-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mtecnic/Qwopus3.5-9B-v3-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mtecnic/Qwopus3.5-9B-v3-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("mtecnic/Qwopus3.5-9B-v3-NVFP4")
model = AutoModelForImageTextToText.from_pretrained("mtecnic/Qwopus3.5-9B-v3-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mtecnic/Qwopus3.5-9B-v3-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mtecnic/Qwopus3.5-9B-v3-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mtecnic/Qwopus3.5-9B-v3-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mtecnic/Qwopus3.5-9B-v3-NVFP4

SGLang

How to use mtecnic/Qwopus3.5-9B-v3-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mtecnic/Qwopus3.5-9B-v3-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mtecnic/Qwopus3.5-9B-v3-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mtecnic/Qwopus3.5-9B-v3-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mtecnic/Qwopus3.5-9B-v3-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mtecnic/Qwopus3.5-9B-v3-NVFP4 with Docker Model Runner:
```
docker model run hf.co/mtecnic/Qwopus3.5-9B-v3-NVFP4
```

Qwopus3.5-9B-v3-NVFP4

NVFP4 (W4A4 FP4) quantization of Jackrong/Qwopus3.5-9B-v3, a Qwen 3.5 9B reasoning and tool-calling model.

	BF16	NVFP4 (this)
Size	18 GB	9.6 GB
Format	bfloat16	compressed-tensors NVFP4
Serving	Any	vLLM v0.19+

Quickstart (vLLM)

vllm serve mtecnic/Qwopus3.5-9B-v3-NVFP4 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 8192 \
    --trust-remote-code \
    --kv-cache-dtype fp8_e5m2 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen35_coder

Requirements: vLLM v0.19+ with transformers==4.57.6 (the version shipped with the v0.19 Docker image). Do NOT upgrade transformers to 5.x inside the container — vLLM v0.19 uses its own internal Qwen 3.5 config which conflicts with transformers 5.x classes.

Usage (OpenAI-compatible API)

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="mtecnic/Qwopus3.5-9B-v3-NVFP4",
    messages=[{"role": "user", "content": "Write a Python function to check if a number is prime."}],
    max_tokens=512,
)
print(response.choices[0].message.content)

Limitations

No vision/image capability — Vision encoder weights are not included (see below)
vLLM only — Requires vLLM v0.19+ with transformers 4.57.6; not compatible with transformers 5.x in the serving container
Partial quantization — 75% of attention layers (Gated DeltaNet) are kept at full precision, so the compression ratio is lower than fully-quantized models
Tokenizer regex warning — A harmless Mistral-inherited regex pattern warning may appear; does not affect tokenization quality

Important: Text-Only Model

This quantization contains text weights only. The base model (Jackrong/Qwopus3.5-9B-v3) is built on Qwen 3.5 9B which has a multimodal architecture (Qwen3_5ForConditionalGeneration), but this checkpoint was quantized via AutoModelForCausalLM, so vision encoder weights are not included.

The config.json retains Qwen3_5ForConditionalGeneration architecture and vision_config solely for vLLM v0.19 compatibility (vLLM has no registered handler for Qwen3_5ForCausalLM). Image and video inputs will not work.

Quantization Details

Quantized using llm-compressor with QuantizationModifier(scheme="NVFP4").

Layers kept at full precision (not quantized):

lm_head — Output head (248K vocab), precision-critical for token probabilities
All linear_attn.* layers (24 of 32 decoder layers) — Gated DeltaNet linear attention layers use delta-rule memory updates and gating projections that are sensitive to quantization, per official llm-compressor guidance

Layers quantized to NVFP4:

self_attn.* (q/k/v/o projections) — Full softmax attention on layers 3, 7, 11, 15, 19, 23, 27, 31
mlp.* (gate/up/down projections) — SwiGLU MLP on all 32 layers

Calibration: 256 samples from allenai/tulu-3-sft-mixture, max_seq_length=512.

Config Modifications for vLLM

The following config changes were made post-quantization for vLLM v0.19 compatibility:

Field	Original	Modified	Reason
`model_type`	`qwen3_5_text`	`qwen3_5`	vLLM only recognizes `qwen3_5`
`architectures`	`Qwen3_5ForCausalLM`	`Qwen3_5ForConditionalGeneration`	vLLM only registers ConditionalGeneration
`tokenizer_class`	`TokenizersBackend`	`Qwen2TokenizerFast`	transformers 4.57.6 compat
`quantization_config.ignore`	`model.layers.*` paths	`model.language_model.layers.*` paths	Match weight key naming

Architecture

Qwen 3.5 9B is a dense transformer with hybrid attention:

32 decoder layers, hidden_size=4096, vocab=248,320
75% Gated DeltaNet (linear attention), 25% full softmax attention
GQA: 16 query heads, 4 KV heads
SwiGLU MLP, RMSNorm, RoPE (theta=10M)

License

Apache-2.0, same as the base model.

Acknowledgments

Jackrong/Qwopus3.5-9B-v3 — Base model
Qwen/Qwen3.5-9B — Foundation model
vllm-project/llm-compressor — Quantization toolkit

Downloads last month: 16

Safetensors

Model size

7B params

Tensor type

F32

BF16

F8_E4M3

Model tree for mtecnic/Qwopus3.5-9B-v3-NVFP4

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

unsloth/Qwen3.5-9B

Adapter

Jackrong/Qwopus3.5-9B-v3

Quantized

(18)

this model

mtecnic
/

Qwopus3.5-9B-v3-NVFP4