gemma-3-270m-modelopt-fp8
FP8 (E4M3) quantized version of google/gemma-3-270m, quantized using NVIDIA ModelOpt static FP8 quantization.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-3-270m |
| Architecture | Gemma3 (18 layers, 4 heads, 1 KV head) |
| Hidden Size | 640 |
| Intermediate Size | 2048 |
| Head Dim | 256 |
| Vocab Size | 262,144 |
| Max Position Embeddings | 32,768 |
| Attention | Sliding window (512) + full attention (every 6th layer) |
| Quantization | FP8 E4M3 (weights + input activations) |
| Quantization Method | NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG) |
| Model Size | 416 MB (safetensors) |
Quantization Details
Method
- Tool: NVIDIA ModelOpt static FP8 quantization
- Format: FP8 E4M3 (torch.float8_e4m3fn)
- Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and RMSNorms remain in BF16.
- Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.
Calibration
- Dataset: CNN/DailyMail (real text data)
- Samples: 64
- Sequence Length: 256
- Batch Size: 4
- Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, GELU output), saved in calib.json
Precision Evaluation
Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):
| Batch | Seq Len | Cosine Similarity | Result |
|---|---|---|---|
| 1 | 128 | 0.9919 | PASS |
| 2 | 512 | 0.9937 | PASS |
| 4 | 1024 | 0.9935 | PASS |
| 8 | 2048 | 0.9937 | PASS |
| 8 | 100 | 0.9920 | PASS |
| 8 | 500 | 0.9933 | PASS |
| 8 | 4000 | 0.9937 | PASS |
All configurations achieve >0.99 cosine similarity with the BF16 baseline.
File Structure
.
βββ config.json # Model config with quantization_config
βββ model.safetensors # FP8 quantized weights + scales
βββ calib.json # Activation scales per layer
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json # Tokenizer config
βββ special_tokens_map.json # Special tokens
βββ added_tokens.json # Added tokens
βββ generation_config.json # Generation config
Intended Use
This model is intended for efficient FP8 inference on NVIDIA GPUs with FP8 support (Hopper architecture and above).
- Downloads last month
- -
Model tree for 1kxia/gemma-3-270m-modelopt-fp8
Base model
google/gemma-3-270m