Qwopus3.5-9B-v3 — W3A16 AutoRound (3-bit)

3-bit weight quantization of Jackrong/Qwopus3.5-9B-v3 using Intel AutoRound.

Note: lm_head is not quantized (kept at bfloat16) due to a known vLLM incompatibility with quantized lm_head for the qwen3_5 architecture. This adds ~1.9 GB but ensures correct loading in vLLM without patching.

Model details

Property	Value
Base model	Jackrong/Qwopus3.5-9B-v3
Architecture	Qwen3.5-9B hybrid (DeltaNet + GatedAttention)
Quantization	W3A16 — 3-bit weights, 16-bit activations
Group size	128
Symmetric	Yes
lm_head	Not quantized (bfloat16)
Format	auto_round (auto_gptq packing)
Tool calling	✅ hermes parser (`<tool_call>` format)
Reasoning	✅ Qwen3 thinking mode

Quantization parameters

AutoRound(
    scheme="W3A16",
    sym=True,
    group_size=128,
    iters=100,
    nsamples=22,
    seqlen=128,
    quant_lm_head=False,         # lm_head stays bfloat16
    quant_nontext_module=False,  # vision tower stays bfloat16
    layer_config={
        "mtp":    {"data_type": "bfloat16"},
        "mtp.fc": {"data_type": "bfloat16"},
    },
)

Calibration: Python, PHP, SQL, Bash, Docker, API patterns — domain-specific for coding agents.

Memory requirements


Model on disk	~11 GB
VRAM (weights only)	~6.5 GB
KV cache (12 GB GPU, util=0.93)	~~3.9 GB (~~32k tokens)

Fits on a single 12 GB VRAM GPU.

Run with vLLM

vllm serve YOUR_USERNAME/Qwopus3.5-9B-W3A16-AutoRound \
    --served-model-name qwopus-9b \
    --port 8000 \
    --host 0.0.0.0 \
    --reasoning-parser qwen3 \
    --language-model-only \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.93 \
    --max-num-seqs 16 \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --dtype half \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Use --tool-call-parser hermes — the model outputs <tool_call> tags (Hermes format), not the qwen3_coder format.

Usage

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")

# Basic chat
response = client.chat.completions.create(
    model="qwopus-9b",
    messages=[{"role": "user", "content": "Write a Python async REST client."}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

# Tool calling
tools = [{
    "type": "function",
    "function": {
        "name": "execute_code",
        "description": "Execute Python code and return output",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string"},
                "language": {"type": "string", "enum": ["python", "bash"]}
            },
            "required": ["code"]
        }
    }
}]

response = client.chat.completions.create(
    model="qwopus-9b",
    messages=[{"role": "user", "content": "Calculate fibonacci up to 10 terms."}],
    tools=tools,
    tool_choice="auto",
)

msg = response.choices[0].message
if msg.tool_calls:
    print(f"Tool: {msg.tool_calls[0].function.name}")
    print(f"Args: {msg.tool_calls[0].function.arguments}")

Known limitations

lm_head not quantized — vLLM's Qwen3_5ForCausalLM currently does not support quantized lm_head; kept at bfloat16
Add --enforce-eager if you encounter CUDA graph issues