protoLabsAI
/

gemma-4-31B-it-FP8

@@ -12,7 +12,16 @@ pipeline_tag: text-generation
 # gemma-4-31B-it-FP8
-FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it), produced by [protoLabsAI](https://huggingface.co/protoLabsAI).
 ## Quantization Details
@@ -20,50 +29,26 @@ FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/g
 |----------|-------|
 | Base model | google/gemma-4-31B-it |
 | Quant method | Native FP8 (float8_e4m3fn) |
-| Weight scheme | Per-block (128×128) |
-| Activation scheme | Dynamic |
-| Modules skipped | embed_tokens, lm_head, norms, visual encoder |
-## Benchmarks (RTX PRO 6000 Blackwell)
-Evaluated on protoLabs eval suite (quick profile: 10 claw-eval tasks + 9 custom suites + 8 function-call tests).
-**gemma-4-26B-A4B-it (MoE, 4B active):**
-| Config | Decode | TTFT | VRAM | Context | Claw Avg | Custom | FC |
-|--------|:------:|:----:|:----:|:-------:|:--------:|:------:|:--:|
-| BF16 1×GPU | 141 tok/s | 52ms | 48.5 GiB | 32K | 0.626 | 9/9 | 8/8 |
-| FP8 1×GPU | 173 tok/s | 83ms | 25.7 GiB | 32K | 0.634 | 9/9 | 8/8 |
-| FP8 TP=2 | 208 tok/s | 254ms | ~13 GiB/GPU | 65K | — | — | — |
-| FP8 TP=2 | 153 tok/s | 301ms | ~45 GiB/GPU | 256K | — | — | — |
-**No quality degradation from FP8 quantization** — FP8 scores marginally higher than BF16 (0.634 vs 0.626).
-## Usage with vLLM
 ```bash
-# On-the-fly FP8 (recommended — loads BF16 weights, quantizes at load time)
-vllm serve google/gemma-4-26B-A4B-it \
   --quantization fp8 \
-  --max-model-len 32768 \
-  --enable-auto-tool-choice \
-  --tool-call-parser gemma4
-# Or use these pre-quantized weights directly
-vllm serve protoLabsAI/gemma-4-31B-it-FP8 \
   --max-model-len 32768
-# TP=2 for 256K context
-vllm serve google/gemma-4-26B-A4B-it \
-  --quantization fp8 --kv-cache-dtype fp8 \
   --tensor-parallel-size 2 \
-  --max-model-len 262144
 ```
-**Requires vLLM built from main** (>= commit 08ed2b968, PR #38826) — Gemma 4 support not yet in a release.
 ## Produced By
-[protoLabsAI](https://github.com/protoLabsAI) — AI inference node running 2× RTX PRO 6000 Blackwell (192 GB VRAM).
-Quantization pipeline: [protoLabsAI/lab/experiments/quantize](https://github.com/protoLabsAI/lab/tree/main/experiments/quantize)

 # gemma-4-31B-it-FP8
+FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) (31B dense), produced by [protoLabsAI](https://huggingface.co/protoLabsAI).
+## Performance (RTX PRO 6000 Blackwell)
+| Config | Decode | VRAM | Claw | Custom | FC |
+|--------|:------:|:----:|:----:|:------:|:--:|
+| FP8 1×GPU | 44 tok/s | 91 GiB | 0.621 | 10/10 | 8/8 |
+| FP8 TP=2 | 66 tok/s | 91 GiB/GPU | 0.621 | 10/10 | 8/8 |
+Dense quality ceiling model. Consider the 26B-A4B MoE variant for 3-4x better speed at similar quality.
 ## Quantization Details
 |----------|-------|
 | Base model | google/gemma-4-31B-it |
 | Quant method | Native FP8 (float8_e4m3fn) |
+| Weight scheme | Per-block (128×128), sharded save |
+| Size | 33.1 GB (vs 59 GB BF16, 44% reduction) |
+## Usage
 ```bash
+# Single GPU (44 tok/s)
+vllm serve google/gemma-4-31B-it \
   --quantization fp8 \
   --max-model-len 32768
+# TP=2 (66 tok/s, more context)
+vllm serve google/gemma-4-31B-it \
+  --quantization fp8 \
   --tensor-parallel-size 2 \
+  --max-model-len 65536
 ```
+Requires vLLM from main (>= PR #38826).
 ## Produced By
+[protoLabsAI](https://github.com/protoLabsAI)