gemma-4-E4B-it-FP8
FP8 quantized version of google/gemma-4-E4B-it (8B params, edge model), produced by protoLabsAI.
Performance (RTX PRO 6000 Blackwell)
| Config | Decode | VRAM | Claw | Custom | FC |
|---|---|---|---|---|---|
| FP8 1×GPU | 182 tok/s | 11.5 GiB | 0.443 | 10/10 | 8/8 |
Lightweight edge model. 11.5 GiB VRAM leaves room for other workloads.
Quantization Details
| Property | Value |
|---|---|
| Base model | google/gemma-4-E4B-it |
| Quant method | Native FP8 (float8_e4m3fn) |
| Weight scheme | Per-block (128×128) |
| Size | 12 GB (vs 15 GB BF16) |
Usage
vllm serve google/gemma-4-E4B-it \
--quantization fp8 \
--max-model-len 32768 \
--enable-auto-tool-choice --tool-call-parser gemma4
Requires vLLM from main (>= PR #38826).
Produced By
- Downloads last month
- 9,966
Model tree for protoLabsAI/gemma-4-E4B-it-FP8
Base model
google/gemma-4-E4B-it