Qwopus3.6-27B-Coder-FP8 INT4 AutoRound

W4A16 INT4 AutoRound quantization of Jackrong/Qwopus3.6-27B-Coder-FP8.

  • Quantization: AutoRound INT4, group size 128, symmetric, auto_round:auto_gptq.
  • Source checkpoint: Jackrong/Qwopus3.6-27B-Coder-FP8 at the time of quantization.
  • Non-text multimodal modules are kept in their original precision.
  • Native Qwen3.5/Qwen3.6 MTP is preserved. mtp.fc is stored as BF16 mtp.fc.weight, not packed mtp.fc.qweight, so vLLM can load the MTP drafter.
  • Produced on one RunPod H200 SXM with AutoRound nightly.

vLLM

vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

For long-context serving, raise --max-model-len according to your KV-cache budget.

vLLM CUDA 13 Smoke and Benchmarks

Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.

The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.

vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded the AutoRound/AutoGPTQ path with MarlinLinearKernel for AutoGPTQLinearMethod, and completed generation. MTP speculative decoding resolved Qwen3_5MTP, loaded without missing-parameter warnings, shared embedding/lm_head with the draft model, and completed generation.

Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.

case input -> output prompts gpu util mode total tok/s prompt tok/s est output tok/s est peak VRAM GiB max W
balanced_graph_u65 1024 -> 128 64 0.65 graph 6369.6 5661.9 707.7 117.6 850.4
prefill_graph_u65 4096 -> 16 32 0.65 graph 7416.7 7387.8 28.9 117.6 857.4
decode_graph_u65 128 -> 256 64 0.65 graph 4221.6 1407.2 2814.4 116.6 819.7
balanced_eager_u65 1024 -> 128 32 0.65 eager 2453.9 2181.3 272.7 118.6 823.9
balanced_graph_u85 1024 -> 128 64 0.85 graph 6614.3 5879.4 734.9 153.9 851.3
balanced_mtp_u65 1024 -> 128 32 0.65 graph + MTP 4796.2 4263.3 532.9 118.1 846.5

First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.

24GB RTX 3090 vLLM Smoke

A small fit smoke was run on 2026-06-14 on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.

The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, and one 128 -> 16 random request. The vLLM Qwen3.5/Qwen3.6 recipe recommends MTP-1 speculative decoding with prefix caching disabled for latency-sensitive low-concurrency serving.

mode load format result peak VRAM KV cache 32k concurrency smoke throughput
no MTP fastsafetensors pass 22174 MiB 64170 tokens 1.96x 50.33 total tok/s, 5.59 output tok/s
MTP-1 safetensors pass 24110 MiB 60681 tokens 1.85x 28.94 total tok/s, 3.22 output tok/s
MTP-1 fastsafetensors fail 23778 MiB n/a n/a CUDA OOM while allocating a 3.00 GiB staging buffer

Recommended 24GB command shape:

vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --load-format safetensors

For MTP-1 on 24GB, keep --load-format safetensors and add:

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Provenance

This repo was generated from the public Apache-2.0 source checkpoint. It keeps the upstream tokenizer, processor, chat template, vision config, and Qwen3.5 MTP config intact.

Downloads last month
35
Safetensors
Model size
6B params
Tensor type
I32
BF16
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound

Quantized
(2)
this model