Qwen3-Embedding-0.6B-modelopt-fp8

FP8 (E4M3) quantized version of Qwen/Qwen3-Embedding-0.6B, quantized using NVIDIA ModelOpt static FP8 quantization.

Model Details

Property	Value
Base Model	Qwen/Qwen3-Embedding-0.6B
Architecture	Qwen3 (28 layers, 16 heads, 8 KV heads)
Hidden Size	1024
Intermediate Size	3072
Vocab Size	151,669
Max Position Embeddings	32,768
Quantization	FP8 E4M3 (weights + input activations)
Quantization Method	NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG)
Model Size	717 MB (safetensors)

Quantization Details

Method

Tool: NVIDIA ModelOpt static FP8 quantization
Format: FP8 E4M3 (torch.float8_e4m3fn)
Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and LayerNorms remain in BF16.
Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.

Calibration

Dataset: CNN/DailyMail (real text data)
Samples: 64
Sequence Length: 256
Batch Size: 4
Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, SiLU output), saved in calib.json

Precision Evaluation

Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):

Batch	Seq Len	Cosine Similarity	Result
1	128	0.9936	PASS
2	512	0.9934	PASS
4	1024	0.9930	PASS
8	2048	0.9927	PASS
8	100	0.9937	PASS
8	500	0.9933	PASS
8	4000	0.9924	PASS

All configurations achieve >0.99 cosine similarity with the BF16 baseline.

File Structure

.
├── config.json              # Model config with quantization_config
├── model.safetensors        # FP8 quantized weights + scales
├── calib.json               # Activation scales per layer
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json    # Tokenizer config
├── vocab.json               # Vocabulary
├── merges.txt               # BPE merges
└── generation_config.json   # Generation config

Intended Use

This model is intended for efficient FP8 inference of text embeddings on NVIDIA GPUs with FP8 support (Hopper architecture and above).

Downloads last month: 116

Safetensors

Model size

0.6B params

Tensor type

BF16

F8_E4M3

Model tree for 1kxia/Qwen3-Embedding-0.6B-modelopt-fp8

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

(215)

this model