gemma-4-31B-it-FP8

FP8 quantized version of google/gemma-4-31B-it (31B dense), produced by protoLabsAI.

Performance (RTX PRO 6000 Blackwell)

Config Decode VRAM Claw Custom FC
FP8 1脳GPU 44 tok/s 91 GiB 0.621 10/10 8/8
FP8 TP=2 66 tok/s 91 GiB/GPU 0.621 10/10 8/8

Dense quality ceiling model. Consider the 26B-A4B MoE variant for 3-4x better speed at similar quality.

Quantization Details

Property Value
Base model google/gemma-4-31B-it
Quant method Native FP8 (float8_e4m3fn)
Weight scheme Per-block (128脳128), sharded save
Size 33.1 GB (vs 59 GB BF16, 44% reduction)

Usage

# Single GPU (44 tok/s)
vllm serve google/gemma-4-31B-it \
  --quantization fp8 \
  --max-model-len 32768

# TP=2 (66 tok/s, more context)
vllm serve google/gemma-4-31B-it \
  --quantization fp8 \
  --tensor-parallel-size 2 \
  --max-model-len 65536

Requires vLLM from main (>= PR #38826).

Produced By

protoLabsAI

Downloads last month
75
Safetensors
Model size
33B params
Tensor type
F32
BF16
F8_E4M3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for protoLabsAI/gemma-4-31B-it-FP8

Quantized
(35)
this model