Instructions to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound") model = AutoModelForMultimodalLM.from_pretrained("WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound
- SGLang
How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with Docker Model Runner:
docker model run hf.co/WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound
Qwopus3.6-27B-Coder-FP8 INT4 AutoRound
W4A16 INT4 AutoRound quantization of Jackrong/Qwopus3.6-27B-Coder-FP8.
- Quantization: AutoRound INT4, group size 128, symmetric,
auto_round:auto_gptq. - Source checkpoint:
Jackrong/Qwopus3.6-27B-Coder-FP8at the time of quantization. - Non-text multimodal modules are kept in their original precision.
- Native Qwen3.5/Qwen3.6 MTP is preserved.
mtp.fcis stored as BF16mtp.fc.weight, not packedmtp.fc.qweight, so vLLM can load the MTP drafter. - Produced on one RunPod H200 SXM with AutoRound nightly.
vLLM
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
--dtype bfloat16 \
--max-model-len 4096 \
--gpu-memory-utilization 0.85 \
--trust-remote-code \
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
For long-context serving, raise --max-model-len according to your KV-cache budget.
vLLM CUDA 13 Smoke and Benchmarks
Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.
The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.
vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded the AutoRound/AutoGPTQ path with MarlinLinearKernel for AutoGPTQLinearMethod, and completed generation. MTP speculative decoding resolved Qwen3_5MTP, loaded without missing-parameter warnings, shared embedding/lm_head with the draft model, and completed generation.
Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.
| case | input -> output | prompts | gpu util | mode | total tok/s | prompt tok/s est | output tok/s est | peak VRAM GiB | max W |
|---|---|---|---|---|---|---|---|---|---|
| balanced_graph_u65 | 1024 -> 128 | 64 | 0.65 | graph | 6369.6 | 5661.9 | 707.7 | 117.6 | 850.4 |
| prefill_graph_u65 | 4096 -> 16 | 32 | 0.65 | graph | 7416.7 | 7387.8 | 28.9 | 117.6 | 857.4 |
| decode_graph_u65 | 128 -> 256 | 64 | 0.65 | graph | 4221.6 | 1407.2 | 2814.4 | 116.6 | 819.7 |
| balanced_eager_u65 | 1024 -> 128 | 32 | 0.65 | eager | 2453.9 | 2181.3 | 272.7 | 118.6 | 823.9 |
| balanced_graph_u85 | 1024 -> 128 | 64 | 0.85 | graph | 6614.3 | 5879.4 | 734.9 | 153.9 | 851.3 |
| balanced_mtp_u65 | 1024 -> 128 | 32 | 0.65 | graph + MTP | 4796.2 | 4263.3 | 532.9 | 118.1 | 846.5 |
First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.
24GB RTX 3090 vLLM Smoke
A small fit smoke was run on 2026-06-14 on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.
The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, and one 128 -> 16 random request. The vLLM Qwen3.5/Qwen3.6 recipe recommends MTP-1 speculative decoding with prefix caching disabled for latency-sensitive low-concurrency serving.
| mode | load format | result | peak VRAM | KV cache | 32k concurrency | smoke throughput |
|---|---|---|---|---|---|---|
| no MTP | fastsafetensors |
pass | 22174 MiB | 64170 tokens | 1.96x | 50.33 total tok/s, 5.59 output tok/s |
| MTP-1 | safetensors |
pass | 24110 MiB | 60681 tokens | 1.85x | 28.94 total tok/s, 3.22 output tok/s |
| MTP-1 | fastsafetensors |
fail | 23778 MiB | n/a | n/a | CUDA OOM while allocating a 3.00 GiB staging buffer |
Recommended 24GB command shape:
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
--dtype bfloat16 \
--max-model-len 32768 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 1 \
--max-num-batched-tokens 2048 \
--enable-chunked-prefill \
--no-enable-prefix-caching \
--load-format safetensors
For MTP-1 on 24GB, keep --load-format safetensors and add:
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
Provenance
This repo was generated from the public Apache-2.0 source checkpoint. It keeps the upstream tokenizer, processor, chat template, vision config, and Qwen3.5 MTP config intact.
- Downloads last month
- 35
Model tree for WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound
Base model
Jackrong/Qwopus3.6-27B-v2