Gemma 4 31B Dense AWQ 4-bit

In-house AWQ 4-bit calibration of google/gemma-4-31b-it, end-to-end from the upstream BF16 base. Thinking + vision aware calibration via balanced_thinking_vision corpus (40% AM-Thinking-v1-Distilled / 30% LLaVA-Instruct / 15% NuminaMath / 15% UltraChat).

Replaces the older mattbucci/gemma-4-31B-it-AutoRound-AWQ which was a repack of Intel's AutoRound GPTQ output (50.4% negative scales). This ship is fully in-house: standard AWQ scales, thinking traces preserved, vision tower kept BF16.

Model Details


Base model	google/gemma-4-31b-it
Architecture	Dense with sliding window attention (50 SWA + 10 full attention layers)
Parameters	31B
Layers	60
Quantization	AWQ 4-bit, group_size=128
Calibration	512 samples × 1024 tokens, `balanced_thinking_vision` recipe (text-only — vision tower BF16)
Scale audit	0 / 410 quantized tensors flagged (clean)

Capability Validation (R9700 / SGLang v0.5.11)

Probe	Result	Notes
basic ("What is the capital of France?")	✅	clean 'paris', finish=stop
thinking	✅	460 tok reasoning, terminated cleanly
vision (red circle on white)	⚠ crashes	see Known Limitations

Known Limitations

Vision: BROKEN on RDNA4. The model generates a coherent vision response but the server crashes mid-decode with HSA_STATUS_ERROR_EXCEPTION 0x1016 in torch_native_backend.py:332 forward_decode. This is the same upstream "Gemma 4 31B Dense — 400-token attention degradation" issue that affects this dense variant on ROCm regardless of recipe. Cross-team validation on Ampere/3090 stack pending — if Ampere passes, this is purely an RDNA4-side ROCm SDPA limitation. For vision workloads, use mattbucci/gemma-4-26B-AWQ (the multimodal MoE flagship, fully working) or mattbucci/Qwen3.6-27B-AWQ (DeltaNet hybrid VL, smaller but vision works end-to-end on RDNA4).
Triton attention degrades at 400+ tokens on Gemma4's 60-layer SWA. Use --attention-backend torch_native (the gemma4-31b launch preset already defaults to this).
Decode speed: 15 tok/s single-user on 2x R9700 (BF16 activations + Triton GEMV).

Usage with SGLang

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh gemma4-31b

The gemma4-31b preset uses torch_native attention + Triton GEMV with FP32 dequant for stability on RDNA4.

Hardware

Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.11 + RDNA4 patches.

License

Apache 2.0, inherited from the upstream Gemma 4 base.

Downloads last month: 3,646

Safetensors

Model size

31B params

Tensor type

F16

I32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support