artificial-citizen commited on
Commit
f5ddd13
·
verified ·
1 Parent(s): 8cb1e54

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +21 -36
README.md CHANGED
@@ -12,7 +12,16 @@ pipeline_tag: text-generation
12
 
13
  # gemma-4-31B-it-FP8
14
 
15
- FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it), produced by [protoLabsAI](https://huggingface.co/protoLabsAI).
 
 
 
 
 
 
 
 
 
16
 
17
  ## Quantization Details
18
 
@@ -20,50 +29,26 @@ FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/g
20
  |----------|-------|
21
  | Base model | google/gemma-4-31B-it |
22
  | Quant method | Native FP8 (float8_e4m3fn) |
23
- | Weight scheme | Per-block (128×128) |
24
- | Activation scheme | Dynamic |
25
- | Modules skipped | embed_tokens, lm_head, norms, visual encoder |
26
-
27
- ## Benchmarks (RTX PRO 6000 Blackwell)
28
-
29
- Evaluated on protoLabs eval suite (quick profile: 10 claw-eval tasks + 9 custom suites + 8 function-call tests).
30
-
31
- **gemma-4-26B-A4B-it (MoE, 4B active):**
32
 
33
- | Config | Decode | TTFT | VRAM | Context | Claw Avg | Custom | FC |
34
- |--------|:------:|:----:|:----:|:-------:|:--------:|:------:|:--:|
35
- | BF16 1×GPU | 141 tok/s | 52ms | 48.5 GiB | 32K | 0.626 | 9/9 | 8/8 |
36
- | FP8 1×GPU | 173 tok/s | 83ms | 25.7 GiB | 32K | 0.634 | 9/9 | 8/8 |
37
- | FP8 TP=2 | 208 tok/s | 254ms | ~13 GiB/GPU | 65K | — | — | — |
38
- | FP8 TP=2 | 153 tok/s | 301ms | ~45 GiB/GPU | 256K | — | — | — |
39
-
40
- **No quality degradation from FP8 quantization** — FP8 scores marginally higher than BF16 (0.634 vs 0.626).
41
-
42
- ## Usage with vLLM
43
 
44
  ```bash
45
- # On-the-fly FP8 (recommended — loads BF16 weights, quantizes at load time)
46
- vllm serve google/gemma-4-26B-A4B-it \
47
  --quantization fp8 \
48
- --max-model-len 32768 \
49
- --enable-auto-tool-choice \
50
- --tool-call-parser gemma4
51
-
52
- # Or use these pre-quantized weights directly
53
- vllm serve protoLabsAI/gemma-4-31B-it-FP8 \
54
  --max-model-len 32768
55
 
56
- # TP=2 for 256K context
57
- vllm serve google/gemma-4-26B-A4B-it \
58
- --quantization fp8 --kv-cache-dtype fp8 \
59
  --tensor-parallel-size 2 \
60
- --max-model-len 262144
61
  ```
62
 
63
- **Requires vLLM built from main** (>= commit 08ed2b968, PR #38826) — Gemma 4 support not yet in a release.
64
 
65
  ## Produced By
66
 
67
- [protoLabsAI](https://github.com/protoLabsAI) — AI inference node running 2× RTX PRO 6000 Blackwell (192 GB VRAM).
68
-
69
- Quantization pipeline: [protoLabsAI/lab/experiments/quantize](https://github.com/protoLabsAI/lab/tree/main/experiments/quantize)
 
12
 
13
  # gemma-4-31B-it-FP8
14
 
15
+ FP8 quantized version of [google/gemma-4-31B-it](https://huggingface.co/google/gemma-4-31B-it) (31B dense), produced by [protoLabsAI](https://huggingface.co/protoLabsAI).
16
+
17
+ ## Performance (RTX PRO 6000 Blackwell)
18
+
19
+ | Config | Decode | VRAM | Claw | Custom | FC |
20
+ |--------|:------:|:----:|:----:|:------:|:--:|
21
+ | FP8 1×GPU | 44 tok/s | 91 GiB | 0.621 | 10/10 | 8/8 |
22
+ | FP8 TP=2 | 66 tok/s | 91 GiB/GPU | 0.621 | 10/10 | 8/8 |
23
+
24
+ Dense quality ceiling model. Consider the 26B-A4B MoE variant for 3-4x better speed at similar quality.
25
 
26
  ## Quantization Details
27
 
 
29
  |----------|-------|
30
  | Base model | google/gemma-4-31B-it |
31
  | Quant method | Native FP8 (float8_e4m3fn) |
32
+ | Weight scheme | Per-block (128×128), sharded save |
33
+ | Size | 33.1 GB (vs 59 GB BF16, 44% reduction) |
 
 
 
 
 
 
 
34
 
35
+ ## Usage
 
 
 
 
 
 
 
 
 
36
 
37
  ```bash
38
+ # Single GPU (44 tok/s)
39
+ vllm serve google/gemma-4-31B-it \
40
  --quantization fp8 \
 
 
 
 
 
 
41
  --max-model-len 32768
42
 
43
+ # TP=2 (66 tok/s, more context)
44
+ vllm serve google/gemma-4-31B-it \
45
+ --quantization fp8 \
46
  --tensor-parallel-size 2 \
47
+ --max-model-len 65536
48
  ```
49
 
50
+ Requires vLLM from main (>= PR #38826).
51
 
52
  ## Produced By
53
 
54
+ [protoLabsAI](https://github.com/protoLabsAI)