Qwen3-Coder-30B MoE AWQ 4-bit

AWQ 4-bit quantization of Qwen3 Coder 30B-A3B optimized for AMD RDNA4 (gfx1201) inference with SGLang.

Model Details


Base model	Qwen/Qwen3-Coder-30B-A3B
Architecture	MoE (128 experts, top-8)
Parameters	30B total / 3B active
Layers	48
Context	32K (tested), 262K (max)
Quantization	AWQ 4-bit, group_size=128

Performance (2x AMD Radeon AI PRO R9700, TP=2)

Decode speed: 30 tok/s single-user on 2x R9700
Launch: scripts/launch.sh coder-30b

Notes

Best throughput MoE model for coding tasks. 166 tok/s at 32 concurrent users.

Usage with SGLang

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh coder-30b

See the RDNA4 Inference Repository for full setup instructions, patches, and benchmarks.

Hardware

Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.10 + RDNA4 patches.

Downloads last month: 143

Safetensors

Model size

31B params

Tensor type

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support