gemma-4-31B-it-FP8
FP8 quantized version of google/gemma-4-31B-it (31B dense), produced by protoLabsAI.
Performance (RTX PRO 6000 Blackwell)
| Config | Decode | VRAM | Claw | Custom | FC |
|---|---|---|---|---|---|
| FP8 1脳GPU | 44 tok/s | 91 GiB | 0.621 | 10/10 | 8/8 |
| FP8 TP=2 | 66 tok/s | 91 GiB/GPU | 0.621 | 10/10 | 8/8 |
Dense quality ceiling model. Consider the 26B-A4B MoE variant for 3-4x better speed at similar quality.
Quantization Details
| Property | Value |
|---|---|
| Base model | google/gemma-4-31B-it |
| Quant method | Native FP8 (float8_e4m3fn) |
| Weight scheme | Per-block (128脳128), sharded save |
| Size | 33.1 GB (vs 59 GB BF16, 44% reduction) |
Usage
# Single GPU (44 tok/s)
vllm serve google/gemma-4-31B-it \
--quantization fp8 \
--max-model-len 32768
# TP=2 (66 tok/s, more context)
vllm serve google/gemma-4-31B-it \
--quantization fp8 \
--tensor-parallel-size 2 \
--max-model-len 65536
Requires vLLM from main (>= PR #38826).
Produced By
- Downloads last month
- 75
Model tree for protoLabsAI/gemma-4-31B-it-FP8
Base model
google/gemma-4-31B-it