Qwopus3.5-9B-v3 โ€” W3A16 AutoRound (3-bit)

3-bit weight quantization of Jackrong/Qwopus3.5-9B-v3 using Intel AutoRound.

Note: lm_head is not quantized (kept at bfloat16) due to a known vLLM incompatibility with quantized lm_head for the qwen3_5 architecture. This adds ~1.9 GB but ensures correct loading in vLLM without patching.

Model details

Property Value
Base model Jackrong/Qwopus3.5-9B-v3
Architecture Qwen3.5-9B hybrid (DeltaNet + GatedAttention)
Quantization W3A16 โ€” 3-bit weights, 16-bit activations
Group size 128
Symmetric Yes
lm_head Not quantized (bfloat16)
Format auto_round (auto_gptq packing)
Tool calling โœ… hermes parser (<tool_call> format)
Reasoning โœ… Qwen3 thinking mode

Quantization parameters

AutoRound(
    scheme="W3A16",
    sym=True,
    group_size=128,
    iters=100,
    nsamples=22,
    seqlen=128,
    quant_lm_head=False,         # lm_head stays bfloat16
    quant_nontext_module=False,  # vision tower stays bfloat16
    layer_config={
        "mtp":    {"data_type": "bfloat16"},
        "mtp.fc": {"data_type": "bfloat16"},
    },
)

Calibration: Python, PHP, SQL, Bash, Docker, API patterns โ€” domain-specific for coding agents.

Memory requirements

Model on disk ~11 GB
VRAM (weights only) ~6.5 GB
KV cache (12 GB GPU, util=0.93) 3.9 GB (32k tokens)

Fits on a single 12 GB VRAM GPU.

Run with vLLM

vllm serve YOUR_USERNAME/Qwopus3.5-9B-W3A16-AutoRound \
    --served-model-name qwopus-9b \
    --port 8000 \
    --host 0.0.0.0 \
    --reasoning-parser qwen3 \
    --language-model-only \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.93 \
    --max-num-seqs 16 \
    --max-num-batched-tokens 8192 \
    --enable-prefix-caching \
    --dtype half \
    --enable-auto-tool-choice \
    --tool-call-parser hermes

Use --tool-call-parser hermes โ€” the model outputs <tool_call> tags (Hermes format), not the qwen3_coder format.

Usage

from openai import OpenAI
import json

client = OpenAI(base_url="http://localhost:8000/v1", api_key="empty")

# Basic chat
response = client.chat.completions.create(
    model="qwopus-9b",
    messages=[{"role": "user", "content": "Write a Python async REST client."}],
    max_tokens=1024,
)
print(response.choices[0].message.content)

# Tool calling
tools = [{
    "type": "function",
    "function": {
        "name": "execute_code",
        "description": "Execute Python code and return output",
        "parameters": {
            "type": "object",
            "properties": {
                "code": {"type": "string"},
                "language": {"type": "string", "enum": ["python", "bash"]}
            },
            "required": ["code"]
        }
    }
}]

response = client.chat.completions.create(
    model="qwopus-9b",
    messages=[{"role": "user", "content": "Calculate fibonacci up to 10 terms."}],
    tools=tools,
    tool_choice="auto",
)

msg = response.choices[0].message
if msg.tool_calls:
    print(f"Tool: {msg.tool_calls[0].function.name}")
    print(f"Args: {msg.tool_calls[0].function.arguments}")

Known limitations

  • lm_head not quantized โ€” vLLM's Qwen3_5ForCausalLM currently does not support quantized lm_head; kept at bfloat16
  • Add --enforce-eager if you encounter CUDA graph issues
Downloads last month
33
Safetensors
Model size
1B params
Tensor type
F32
ยท
I32
ยท
BF16
ยท
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for piotr-sikora/Qwopus3.5-9B-W3A16-AutoRound-coding

Finetuned
Qwen/Qwen3.5-9B
Quantized
(13)
this model