Qwen3-Embedding-0.6B-modelopt-fp8

FP8 (E4M3) quantized version of Qwen/Qwen3-Embedding-0.6B, quantized using NVIDIA ModelOpt static FP8 quantization.

Model Details

Property Value
Base Model Qwen/Qwen3-Embedding-0.6B
Architecture Qwen3 (28 layers, 16 heads, 8 KV heads)
Hidden Size 1024
Intermediate Size 3072
Vocab Size 151,669
Max Position Embeddings 32,768
Quantization FP8 E4M3 (weights + input activations)
Quantization Method NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG)
Model Size 717 MB (safetensors)

Quantization Details

Method

  • Tool: NVIDIA ModelOpt static FP8 quantization
  • Format: FP8 E4M3 (torch.float8_e4m3fn)
  • Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and LayerNorms remain in BF16.
  • Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.

Calibration

  • Dataset: CNN/DailyMail (real text data)
  • Samples: 64
  • Sequence Length: 256
  • Batch Size: 4
  • Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, SiLU output), saved in calib.json

Precision Evaluation

Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):

Batch Seq Len Cosine Similarity Result
1 128 0.9936 PASS
2 512 0.9934 PASS
4 1024 0.9930 PASS
8 2048 0.9927 PASS
8 100 0.9937 PASS
8 500 0.9933 PASS
8 4000 0.9924 PASS

All configurations achieve >0.99 cosine similarity with the BF16 baseline.

File Structure

.
β”œβ”€β”€ config.json              # Model config with quantization_config
β”œβ”€β”€ model.safetensors        # FP8 quantized weights + scales
β”œβ”€β”€ calib.json               # Activation scales per layer
β”œβ”€β”€ tokenizer.json           # Tokenizer
β”œβ”€β”€ tokenizer_config.json    # Tokenizer config
β”œβ”€β”€ vocab.json               # Vocabulary
β”œβ”€β”€ merges.txt               # BPE merges
└── generation_config.json   # Generation config

Intended Use

This model is intended for efficient FP8 inference of text embeddings on NVIDIA GPUs with FP8 support (Hopper architecture and above).

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 1kxia/Qwen3-Embedding-0.6B-modelopt-fp8

Finetuned
(126)
this model