Qwopus3.5-9B-v3 โ W3A16 AutoRound (3-bit)
3-bit weight quantization of Jackrong/Qwopus3.5-9B-v3 using Intel AutoRound.
Note:
lm_headis not quantized (kept at bfloat16) due to a known vLLM incompatibility with quantized lm_head for theqwen3_5architecture. This adds ~1.9 GB but ensures correct loading in vLLM without patching.
Model details
| Property | Value |
|---|---|
| Base model | Jackrong/Qwopus3.5-9B-v3 |
| Architecture | Qwen3.5-9B hybrid (DeltaNet + GatedAttention) |
| Quantization | W3A16 โ 3-bit weights, 16-bit activations |
| Group size | 128 |
| Symmetric | Yes |
| lm_head | Not quantized (bfloat16) |
| Format | auto_round (auto_gptq packing) |
| Tool calling | โ
hermes parser (<tool_call> format) |
| Reasoning | โ Qwen3 thinking mode |
Quantization parameters
AutoRound(
scheme="W3A16",
sym=True,
group_size=128,
iters=100,
nsamples=22,
seqlen=128,
quant_lm_head=False, # lm_head stays bfloat16
quant_nontext_module=False, # vision tower stays bfloat16
layer_config={
"mtp": {"data_type": "bfloat16"},
"mtp.fc": {"data_type": "bfloat16"},
},
)
Calibration: Python, PHP, SQL, Bash, Docker, API patterns โ domain-specific for coding agents.
Memory requirements
| Model on disk | ~11 GB |
| VRAM (weights only) | ~6.5 GB |
| KV cache (12 GB GPU, util=0.93) |
Fits on a single 12 GB VRAM GPU.
Run with vLLM
vllm serve YOUR_USERNAME/Qwopus3.5-9B-W3A16-AutoRound \
--served-model-name qwopus-9b \
--port 8000 \
--host 0.0.0.0 \
--reasoning-parser qwen3 \
--language-model-only \
--max-model-len 65536 \
--gpu-memory-utilization 0.93 \
--max-num-seqs 16 \
--max-num-batched-tokens 8192 \
--enable-prefix-caching \
--dtype half \
--enable-auto-tool-choice \
--tool-call-parser hermes
Use
--tool-call-parser hermesโ the model outputs<tool_call>tags (Hermes format), not theqwen3_coderformat.
Usage
from openai import OpenAI
import json
client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")
# Basic chat
response = client.chat.completions.create(
model="qwopus-9b",
messages=[{"role": "user", "content": "Write a Python async REST client."}],
max_tokens=1024,
)
print(response.choices[0].message.content)
# Tool calling
tools = [{
"type": "function",
"function": {
"name": "execute_code",
"description": "Execute Python code and return output",
"parameters": {
"type": "object",
"properties": {
"code": {"type": "string"},
"language": {"type": "string", "enum": ["python", "bash"]}
},
"required": ["code"]
}
}
}]
response = client.chat.completions.create(
model="qwopus-9b",
messages=[{"role": "user", "content": "Calculate fibonacci up to 10 terms."}],
tools=tools,
tool_choice="auto",
)
msg = response.choices[0].message
if msg.tool_calls:
print(f"Tool: {msg.tool_calls[0].function.name}")
print(f"Args: {msg.tool_calls[0].function.arguments}")
Known limitations
- lm_head not quantized โ vLLM's
Qwen3_5ForCausalLMcurrently does not support quantized lm_head; kept at bfloat16 - Add
--enforce-eagerif you encounter CUDA graph issues
- Downloads last month
- 33
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for piotr-sikora/Qwopus3.5-9B-W3A16-AutoRound-coding
Base model
Qwen/Qwen3.5-9B-Base Finetuned
Qwen/Qwen3.5-9B Finetuned
unsloth/Qwen3.5-9B Adapter
Jackrong/Qwopus3.5-9B-v3