Instructions to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound")
model = AutoModelForMultimodalLM.from_pretrained("WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound

SGLang

How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound with Docker Model Runner:
```
docker model run hf.co/WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound
```

Qwopus3.6-27B-Coder-FP8 INT4 AutoRound

W4A16 INT4 AutoRound quantization of Jackrong/Qwopus3.6-27B-Coder-FP8.

Quantization: AutoRound INT4, group size 128, symmetric, auto_round:auto_gptq.
Source checkpoint: Jackrong/Qwopus3.6-27B-Coder-FP8 at the time of quantization.
Non-text multimodal modules are kept in their original precision.
Native Qwen3.5/Qwen3.6 MTP is preserved. mtp.fc is stored as BF16 mtp.fc.weight, not packed mtp.fc.qweight, so vLLM can load the MTP drafter.
Produced on one RunPod H200 SXM with AutoRound nightly.

vLLM

vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

For long-context serving, raise --max-model-len according to your KV-cache budget.

vLLM CUDA 13 Smoke and Benchmarks

Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.

The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.

vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded the AutoRound/AutoGPTQ path with MarlinLinearKernel for AutoGPTQLinearMethod, and completed generation. MTP speculative decoding resolved Qwen3_5MTP, loaded without missing-parameter warnings, shared embedding/lm_head with the draft model, and completed generation.

Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.

case	input -> output	prompts	gpu util	mode	total tok/s	prompt tok/s est	output tok/s est	peak VRAM GiB	max W
balanced_graph_u65	1024 -> 128	64	0.65	graph	6369.6	5661.9	707.7	117.6	850.4
prefill_graph_u65	4096 -> 16	32	0.65	graph	7416.7	7387.8	28.9	117.6	857.4
decode_graph_u65	128 -> 256	64	0.65	graph	4221.6	1407.2	2814.4	116.6	819.7
balanced_eager_u65	1024 -> 128	32	0.65	eager	2453.9	2181.3	272.7	118.6	823.9
balanced_graph_u85	1024 -> 128	64	0.85	graph	6614.3	5879.4	734.9	153.9	851.3
balanced_mtp_u65	1024 -> 128	32	0.65	graph + MTP	4796.2	4263.3	532.9	118.1	846.5

First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.

24GB RTX 3090 vLLM Smoke

A small fit smoke was run on 2026-06-14 on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.

The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, and one 128 -> 16 random request. The vLLM Qwen3.5/Qwen3.6 recipe recommends MTP-1 speculative decoding with prefix caching disabled for latency-sensitive low-concurrency serving.

mode	load format	result	peak VRAM	KV cache	32k concurrency	smoke throughput
no MTP	`fastsafetensors`	pass	22174 MiB	64170 tokens	1.96x	50.33 total tok/s, 5.59 output tok/s
MTP-1	`safetensors`	pass	24110 MiB	60681 tokens	1.85x	28.94 total tok/s, 3.22 output tok/s
MTP-1	`fastsafetensors`	fail	23778 MiB	n/a	n/a	CUDA OOM while allocating a 3.00 GiB staging buffer

Recommended 24GB command shape:

vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --load-format safetensors

For MTP-1 on 24GB, keep --load-format safetensors and add:

--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Provenance

This repo was generated from the public Apache-2.0 source checkpoint. It keeps the upstream tokenizer, processor, chat template, vision config, and Qwen3.5 MTP config intact.

Downloads last month: 35

Safetensors

Model size

6B params

Tensor type

I32

BF16

F16

Model tree for WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound

Base model

Jackrong/Qwopus3.6-27B-v2

Adapter

Jackrong/Qwopus3.6-27B-Coder-FP8

Quantized

(2)

this model