Qwen3-VL-Embedding-2B-FP8-DYNAMIC
FP8 quantized version of Qwen3-VL-Embedding-2B optimized for efficient deployment with vLLM.
This model provides ~50% memory reduction compared to bfloat16 while maintaining high-quality multimodal embeddings through selective quantization.
Model Description
- Base Model: Qwen/Qwen3-VL-Embedding-2B
- Model Type: Multimodal Embedding Model (Text + Vision)
- Quantization: FP8_DYNAMIC via llmcompressor
- Architecture: Qwen3VLModel
- Parameters: 2B
- License: Apache 2.0
Key Features
- Memory Efficient: ~50% VRAM reduction compared to bfloat16
- Visual Quality Preserved: Full precision visual layers (24 blocks)
- Matryoshka Embeddings: Supports flexible dimensions (from 64 to 2048)
- OpenAI-Compatible API: Works with OpenAI SDK via vLLM
Quantization Details
Method: FP8_DYNAMIC
- Weights: FP8 per-channel quantization (symmetric)
- Activations: FP8 dynamic per-token quantization (symmetric)
- Visual Layers: Kept in full precision (bfloat16) for quality
- Target Layers: Linear layers only
Quantization Configuration:
QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=['re:visual.*'] # Preserve visual layer quality
)
Model Specifications
| Property | Value |
|---|---|
| Text Layers | 28 layers (FP8 quantized) |
| Visual Layers | 24 blocks (bfloat16, full precision) |
| Default Embedding Dim | 2048 |
| Matryoshka Dimensions | 64 - 2048 |
| Model Size | ~2.8 GB (vs ~4.6 GB bfloat16) |
Intended Uses
Supported Input Types
- Text only
- Image only
- Text + Image (multimodal)
How to Get Started
Deployment with vLLM
vllm serve Qwen3-VL-Embedding-2B-FP8-DYNAMIC \
--runner pooling \
--convert embed \
--hf-overrides '{"is_matryoshka": true}' \
--quantization compressed-tensors \
Important Parameters:
--runner pooling: Enables embedding mode--convert embed: Extract embeddings from model output--quantization compressed-tensors: Load FP8 quantized weights--hf-overrides '{"is_matryoshka": true}': Enable Matryoshka embedding support
Usage Examples
Text Embeddings (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
api_key="EMPTY",
base_url="http://localhost:8000/v1"
)
# Default 2048 dimensions
response = client.embeddings.create(
model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
input="Your text here",
encoding_format="float"
)
embedding = response.data[0].embedding # List[float] with 2048 dims
Matryoshka Embeddings (Custom Dimensions)
# Request smaller embedding for efficiency
response = client.embeddings.create(
model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
input="Your text here",
encoding_format="float",
dimensions=512 # Options: 128, 256, 512, 1024, 2048
)
embedding = response.data[0].embedding # List[float] with 512 dims
Batch Processing
texts = [
"First document to embed",
"Second document to embed",
"Third document to embed"
]
response = client.embeddings.create(
model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
input=texts,
encoding_format="float",
dimensions=256
)
# Access each embedding
for i, data in enumerate(response.data):
embedding = data.embedding
print(f"Document {i+1}: {len(embedding)} dimensions")
Multimodal Embeddings (Text + Image)
For image and multimodal embeddings, use vLLM's custom message format:
import base64
def encode_image(image_path):
with open(image_path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
# Image only
messages = [{
"role": "user",
"content": [{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{encode_image('image.jpg')}"}
}]
}]
# Text + Image
messages = [{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image"},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('image.jpg')}"}}
]
}]
# Call embeddings endpoint
response = client.post(
"/embeddings",
body={
"messages": messages,
"model": "Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
"encoding_format": "float",
"dimensions": 512
},
cast_to=object
)
embedding = response.data[0].embedding
Technical Notes
Why FP8_DYNAMIC?
FP8_DYNAMIC balances memory efficiency with quality by:
- Statically quantizing weights to FP8 per-channel
- Dynamically quantizing activations to FP8 per-token during inference
- Avoiding accuracy loss from static activation quantization
Why Preserve Visual Layers?
Visual processing requires higher numerical precision. Keeping the 24 visual transformer blocks in bfloat16 ensures:
- High-fidelity image feature extraction
- Better cross-modal alignment
- Minimal impact on multimodal task performance
The visual layers account for a smaller portion of total parameters, making this a worthwhile trade-off.
Reproducibility
To quantize the model yourself:
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModel, AutoTokenizer
# Load base model
model = AutoModel.from_pretrained(
"Qwen/Qwen3-VL-Embedding-2B",
trust_remote_code=True,
dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained(
"Qwen/Qwen3-VL-Embedding-2B",
trust_remote_code=True
)
# Configure quantization
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=['re:visual.*'] # Preserve visual layers
)
# Apply quantization
oneshot(
model=model,
recipe=recipe,
output_dir="Qwen3-VL-Embedding-2B-FP8-DYNAMIC"
)
# Save
model.save_pretrained("Qwen3-VL-Embedding-2B-FP8-DYNAMIC")
tokenizer.save_pretrained("Qwen3-VL-Embedding-2B-FP8-DYNAMIC")
After quantization, run the tensor key fixing script (included in model repository).
License
This model inherits the Apache 2.0 license from the base Qwen3-VL-Embedding-2B model.
Acknowledgments
- Base Model: Qwen Team for Qwen3-VL-Embedding-2B
- Quantization Framework: Neural Magic for llmcompressor
- Serving Engine: vLLM Team for high-performance inference
Model Card Authors
This model card was created by the quantization author.
Model Card Contact
For issues or questions about this quantized model, please open an issue on the model repository or contact via HuggingFace.
- Downloads last month
- 2,748