gemma-3-270m-modelopt-fp8

FP8 (E4M3) quantized version of google/gemma-3-270m, quantized using NVIDIA ModelOpt static FP8 quantization.

Model Details

Property	Value
Base Model	google/gemma-3-270m
Architecture	Gemma3 (18 layers, 4 heads, 1 KV head)
Hidden Size	640
Intermediate Size	2048
Head Dim	256
Vocab Size	262,144
Max Position Embeddings	32,768
Attention	Sliding window (512) + full attention (every 6th layer)
Quantization	FP8 E4M3 (weights + input activations)
Quantization Method	NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG)
Model Size	416 MB (safetensors)

Quantization Details

Method

Tool: NVIDIA ModelOpt static FP8 quantization
Format: FP8 E4M3 (torch.float8_e4m3fn)
Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and RMSNorms remain in BF16.
Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.

Calibration

Dataset: CNN/DailyMail (real text data)
Samples: 64
Sequence Length: 256
Batch Size: 4
Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, GELU output), saved in calib.json

Precision Evaluation

Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):

Batch	Seq Len	Cosine Similarity	Result
1	128	0.9919	PASS
2	512	0.9937	PASS
4	1024	0.9935	PASS
8	2048	0.9937	PASS
8	100	0.9920	PASS
8	500	0.9933	PASS
8	4000	0.9937	PASS

All configurations achieve >0.99 cosine similarity with the BF16 baseline.

File Structure

.
├── config.json              # Model config with quantization_config
├── model.safetensors        # FP8 quantized weights + scales
├── calib.json               # Activation scales per layer
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json    # Tokenizer config
├── special_tokens_map.json  # Special tokens
├── added_tokens.json        # Added tokens
└── generation_config.json   # Generation config

Intended Use

This model is intended for efficient FP8 inference on NVIDIA GPUs with FP8 support (Hopper architecture and above).

Downloads last month: 192

Safetensors

Model size

0.3B params

Tensor type

BF16

F8_E4M3

Model tree for 1kxia/gemma-3-270m-modelopt-fp8

Base model

google/gemma-3-270m

Finetuned

(149)

this model