Gemma 4 31B Dense AWQ 4-bit

In-house AWQ 4-bit calibration of google/gemma-4-31b-it, end-to-end from the upstream BF16 base. Thinking + vision aware calibration via balanced_thinking_vision corpus (40% AM-Thinking-v1-Distilled / 30% LLaVA-Instruct / 15% NuminaMath / 15% UltraChat).

Replaces the older mattbucci/gemma-4-31B-it-AutoRound-AWQ which was a repack of Intel's AutoRound GPTQ output (50.4% negative scales). This ship is fully in-house: standard AWQ scales, thinking traces preserved, vision tower kept BF16.

Model Details

Base model google/gemma-4-31b-it
Architecture Dense with sliding window attention (50 SWA + 10 full attention layers)
Parameters 31B
Layers 60
Quantization AWQ 4-bit, group_size=128
Calibration 512 samples × 1024 tokens, balanced_thinking_vision recipe (text-only — vision tower BF16)
Scale audit 0 / 410 quantized tensors flagged (clean)

Capability Validation (R9700 / SGLang v0.5.11)

Probe Result Notes
basic ("What is the capital of France?") clean 'paris', finish=stop
thinking 460 tok reasoning, terminated cleanly
vision (red circle on white) ⚠ crashes see Known Limitations

Known Limitations

  • Vision: BROKEN on RDNA4. The model generates a coherent vision response but the server crashes mid-decode with HSA_STATUS_ERROR_EXCEPTION 0x1016 in torch_native_backend.py:332 forward_decode. This is the same upstream "Gemma 4 31B Dense — 400-token attention degradation" issue that affects this dense variant on ROCm regardless of recipe. Cross-team validation on Ampere/3090 stack pending — if Ampere passes, this is purely an RDNA4-side ROCm SDPA limitation. For vision workloads, use mattbucci/gemma-4-26B-AWQ (the multimodal MoE flagship, fully working) or mattbucci/Qwen3.6-27B-AWQ (DeltaNet hybrid VL, smaller but vision works end-to-end on RDNA4).
  • Triton attention degrades at 400+ tokens on Gemma4's 60-layer SWA. Use --attention-backend torch_native (the gemma4-31b launch preset already defaults to this).
  • Decode speed: 15 tok/s single-user on 2x R9700 (BF16 activations + Triton GEMV).

Usage with SGLang

git clone https://github.com/mattbucci/2x-R9700-RDNA4-GFX1201-sglang-inference
cd 2x-R9700-RDNA4-GFX1201-sglang-inference
./scripts/setup.sh
scripts/launch.sh gemma4-31b

The gemma4-31b preset uses torch_native attention + Triton GEMV with FP32 dequant for stability on RDNA4.

Hardware

Tested on 2x AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32+34 GB VRAM) with ROCm 7.2 and SGLang v0.5.11 + RDNA4 patches.

License

Apache 2.0, inherited from the upstream Gemma 4 base.

Downloads last month
39,618
Safetensors
Model size
31B params
Tensor type
F16
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support