Qwen3-Embedding-0.6B-modelopt-fp8
FP8 (E4M3) quantized version of Qwen/Qwen3-Embedding-0.6B, quantized using NVIDIA ModelOpt static FP8 quantization.
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-Embedding-0.6B |
| Architecture | Qwen3 (28 layers, 16 heads, 8 KV heads) |
| Hidden Size | 1024 |
| Intermediate Size | 3072 |
| Vocab Size | 151,669 |
| Max Position Embeddings | 32,768 |
| Quantization | FP8 E4M3 (weights + input activations) |
| Quantization Method | NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG) |
| Model Size | 717 MB (safetensors) |
Quantization Details
Method
- Tool: NVIDIA ModelOpt static FP8 quantization
- Format: FP8 E4M3 (torch.float8_e4m3fn)
- Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and LayerNorms remain in BF16.
- Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.
Calibration
- Dataset: CNN/DailyMail (real text data)
- Samples: 64
- Sequence Length: 256
- Batch Size: 4
- Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, SiLU output), saved in calib.json
Precision Evaluation
Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):
| Batch | Seq Len | Cosine Similarity | Result |
|---|---|---|---|
| 1 | 128 | 0.9936 | PASS |
| 2 | 512 | 0.9934 | PASS |
| 4 | 1024 | 0.9930 | PASS |
| 8 | 2048 | 0.9927 | PASS |
| 8 | 100 | 0.9937 | PASS |
| 8 | 500 | 0.9933 | PASS |
| 8 | 4000 | 0.9924 | PASS |
All configurations achieve >0.99 cosine similarity with the BF16 baseline.
File Structure
.
βββ config.json # Model config with quantization_config
βββ model.safetensors # FP8 quantized weights + scales
βββ calib.json # Activation scales per layer
βββ tokenizer.json # Tokenizer
βββ tokenizer_config.json # Tokenizer config
βββ vocab.json # Vocabulary
βββ merges.txt # BPE merges
βββ generation_config.json # Generation config
Intended Use
This model is intended for efficient FP8 inference of text embeddings on NVIDIA GPUs with FP8 support (Hopper architecture and above).
- Downloads last month
- -