gemma-4-E4B-it-FP8

FP8 quantized version of google/gemma-4-E4B-it (8B params, edge model), produced by protoLabsAI.

Performance (RTX PRO 6000 Blackwell)

Config Decode VRAM Claw Custom FC
FP8 1×GPU 182 tok/s 11.5 GiB 0.443 10/10 8/8

Lightweight edge model. 11.5 GiB VRAM leaves room for other workloads.

Quantization Details

Property Value
Base model google/gemma-4-E4B-it
Quant method Native FP8 (float8_e4m3fn)
Weight scheme Per-block (128×128)
Size 12 GB (vs 15 GB BF16)

Usage

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --max-model-len 32768 \
  --enable-auto-tool-choice --tool-call-parser gemma4

Requires vLLM from main (>= PR #38826).

Produced By

protoLabsAI

Downloads last month
9,966
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for protoLabsAI/gemma-4-E4B-it-FP8

Quantized
(26)
this model