Gemma 4 31B Dense AWQ 4-bit
In-house AWQ 4-bit calibration of google/gemma-4-31b-it, end-to-end from the upstream BF16 base. Thinking + vision aware calibration via balanced_thinking_vision corpus (40% AM-Thinking-v1-Distilled / 30% LLaVA-Instruct / 15% NuminaMath / 15% UltraChat).
Replaces the older mattbucci/gemma-4-31B-it-AutoRound-AWQ which was a repack of Intel's AutoRound GPTQ output (50.4% negative scales). This ship is fully in-house: standard AWQ scales, thinking traces preserved, vision tower kept BF16.
Model Details
| Base model | google/gemma-4-31b-it |
| Architecture | Dense with sliding window attention (50 SWA + 10 full attention layers) |
| Parameters | 31B |
| Layers | 60 |
| Quantization | AWQ 4-bit, group_size=128 |
| Calibration | 512 samples × 1024 tokens, balanced_thinking_vision recipe (text-only — vision tower BF16) |
| Scale audit | 0 / 410 quantized tensors flagged (clean) |
Capability Validation (R9700 / SGLang v0.5.11)
| Probe | Result | Notes |
|---|---|---|
| basic ("What is the capital of France?") | ✅ | clean 'paris', finish=stop |
| thinking | ✅ | 460 tok reasoning, terminated cleanly |
| vision (red circle on white) | ⚠ crashes | see Known Limitations |
Known Limitations
- Vision: BROKEN on RDNA4. The model generates a coherent vision response but the server crashes mid-decode with
HSA_STATUS_ERROR_EXCEPTION 0x1016intorch_native_backend.py:332 forward_decode. This is the same upstream "Gemma 4 31B Dense — 400-token attention degradation" issue that affects this dense variant on ROCm regardless of recipe. Cross-team validation on Ampere/3090 stack pending — if Ampere passes, this is purely an RDNA4-side ROCm SDPA limitation. For vision workloads, usemattbucci/gemma-4-26B-AWQ(the multimodal MoE flagship, fully working) ormattbucci/Qwen3.6-27B-AWQ(DeltaNet hybrid VL, smaller but vision works end-to-end on RDNA4). - Triton attention degrades at 400+ tokens on Gemma4's 60-layer SWA. Use
--attention-backend torch_native(thegemma4-31blaunch preset already defaults to this). - Decode speed: 15 tok/s single-user on 2x R9700 (BF16 activations + Triton GEMV).
Usage with SGLang
git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh gemma4-31b
The gemma4-31b preset uses torch_native attention + Triton GEMV with FP32 dequant for stability on RDNA4.
Hardware
Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.11 + RDNA4 patches.
License
Apache 2.0, inherited from the upstream Gemma 4 base.
- Downloads last month
- 39,618