gemma-4-E4B-it-FP8

FP8 quantized version of google/gemma-4-E4B-it (8B params, edge model), produced by protoLabsAI.

Performance (RTX PRO 6000 Blackwell)

Config	Decode	VRAM	Claw	Custom	FC
FP8 1×GPU	182 tok/s	11.5 GiB	0.443	10/10	8/8

Lightweight edge model. 11.5 GiB VRAM leaves room for other workloads.

vllm serve google/gemma-4-E4B-it \
  --quantization fp8 \
  --max-model-len 32768 \
  --enable-auto-tool-choice --tool-call-parser gemma4

Requires vLLM from main (>= PR #38826).

Base model

Quantized

(26)

this model