Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -12,7 +12,16 @@ pipeline_tag: text-generation
|
|
| 12 |
|
| 13 |
# gemma-4-31B-it-FP8
|
| 14 |
|
| 15 |
-
FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it), produced by [protoLabsAI](https://huggingface.co/protoLabsAI).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Quantization Details
|
| 18 |
|
|
@@ -20,50 +29,26 @@ FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/g
|
|
| 20 |
|----------|-------|
|
| 21 |
| Base model | google/gemma-4-31B-it |
|
| 22 |
| Quant method | Native FP8 (float8_e4m3fn) |
|
| 23 |
-
| Weight scheme | Per-block (128×128) |
|
| 24 |
-
|
|
| 25 |
-
| Modules skipped | embed_tokens, lm_head, norms, visual encoder |
|
| 26 |
-
|
| 27 |
-
## Benchmarks (RTX PRO 6000 Blackwell)
|
| 28 |
-
|
| 29 |
-
Evaluated on protoLabs eval suite (quick profile: 10 claw-eval tasks + 9 custom suites + 8 function-call tests).
|
| 30 |
-
|
| 31 |
-
**gemma-4-26B-A4B-it (MoE, 4B active):**
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|--------|:------:|:----:|:----:|:-------:|:--------:|:------:|:--:|
|
| 35 |
-
| BF16 1×GPU | 141 tok/s | 52ms | 48.5 GiB | 32K | 0.626 | 9/9 | 8/8 |
|
| 36 |
-
| FP8 1×GPU | 173 tok/s | 83ms | 25.7 GiB | 32K | 0.634 | 9/9 | 8/8 |
|
| 37 |
-
| FP8 TP=2 | 208 tok/s | 254ms | ~13 GiB/GPU | 65K | — | — | — |
|
| 38 |
-
| FP8 TP=2 | 153 tok/s | 301ms | ~45 GiB/GPU | 256K | — | — | — |
|
| 39 |
-
|
| 40 |
-
**No quality degradation from FP8 quantization** — FP8 scores marginally higher than BF16 (0.634 vs 0.626).
|
| 41 |
-
|
| 42 |
-
## Usage with vLLM
|
| 43 |
|
| 44 |
```bash
|
| 45 |
-
#
|
| 46 |
-
vllm serve google/gemma-4-
|
| 47 |
--quantization fp8 \
|
| 48 |
-
--max-model-len 32768 \
|
| 49 |
-
--enable-auto-tool-choice \
|
| 50 |
-
--tool-call-parser gemma4
|
| 51 |
-
|
| 52 |
-
# Or use these pre-quantized weights directly
|
| 53 |
-
vllm serve protoLabsAI/gemma-4-31B-it-FP8 \
|
| 54 |
--max-model-len 32768
|
| 55 |
|
| 56 |
-
# TP=2
|
| 57 |
-
vllm serve google/gemma-4-
|
| 58 |
-
--quantization fp8
|
| 59 |
--tensor-parallel-size 2 \
|
| 60 |
-
--max-model-len
|
| 61 |
```
|
| 62 |
|
| 63 |
-
|
| 64 |
|
| 65 |
## Produced By
|
| 66 |
|
| 67 |
-
[protoLabsAI](https://github.com/protoLabsAI)
|
| 68 |
-
|
| 69 |
-
Quantization pipeline: [protoLabsAI/lab/experiments/quantize](https://github.com/protoLabsAI/lab/tree/main/experiments/quantize)
|
|
|
|
| 12 |
|
| 13 |
# gemma-4-31B-it-FP8
|
| 14 |
|
| 15 |
+
FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) (31B dense), produced by [protoLabsAI](https://huggingface.co/protoLabsAI).
|
| 16 |
+
|
| 17 |
+
## Performance (RTX PRO 6000 Blackwell)
|
| 18 |
+
|
| 19 |
+
| Config | Decode | VRAM | Claw | Custom | FC |
|
| 20 |
+
|--------|:------:|:----:|:----:|:------:|:--:|
|
| 21 |
+
| FP8 1×GPU | 44 tok/s | 91 GiB | 0.621 | 10/10 | 8/8 |
|
| 22 |
+
| FP8 TP=2 | 66 tok/s | 91 GiB/GPU | 0.621 | 10/10 | 8/8 |
|
| 23 |
+
|
| 24 |
+
Dense quality ceiling model. Consider the 26B-A4B MoE variant for 3-4x better speed at similar quality.
|
| 25 |
|
| 26 |
## Quantization Details
|
| 27 |
|
|
|
|
| 29 |
|----------|-------|
|
| 30 |
| Base model | google/gemma-4-31B-it |
|
| 31 |
| Quant method | Native FP8 (float8_e4m3fn) |
|
| 32 |
+
| Weight scheme | Per-block (128×128), sharded save |
|
| 33 |
+
| Size | 33.1 GB (vs 59 GB BF16, 44% reduction) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
## Usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
| 37 |
```bash
|
| 38 |
+
# Single GPU (44 tok/s)
|
| 39 |
+
vllm serve google/gemma-4-31B-it \
|
| 40 |
--quantization fp8 \
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
--max-model-len 32768
|
| 42 |
|
| 43 |
+
# TP=2 (66 tok/s, more context)
|
| 44 |
+
vllm serve google/gemma-4-31B-it \
|
| 45 |
+
--quantization fp8 \
|
| 46 |
--tensor-parallel-size 2 \
|
| 47 |
+
--max-model-len 65536
|
| 48 |
```
|
| 49 |
|
| 50 |
+
Requires vLLM from main (>= PR #38826).
|
| 51 |
|
| 52 |
## Produced By
|
| 53 |
|
| 54 |
+
[protoLabsAI](https://github.com/protoLabsAI)
|
|
|
|
|
|