Qwen3-VL-Embedding-2B-FP8-DYNAMIC

FP8 quantized version of Qwen3-VL-Embedding-2B optimized for efficient deployment with vLLM.

This model provides ~50% memory reduction compared to bfloat16 while maintaining high-quality multimodal embeddings through selective quantization.

Model Description

Base Model: Qwen/Qwen3-VL-Embedding-2B
Model Type: Multimodal Embedding Model (Text + Vision)
Quantization: FP8_DYNAMIC via llmcompressor
Architecture: Qwen3VLModel
Parameters: 2B
License: Apache 2.0

Key Features

Memory Efficient: ~50% VRAM reduction compared to bfloat16
Visual Quality Preserved: Full precision visual layers (24 blocks)
Matryoshka Embeddings: Supports flexible dimensions (from 64 to 2048)
OpenAI-Compatible API: Works with OpenAI SDK via vLLM

Quantization Details

Method: FP8_DYNAMIC

Weights: FP8 per-channel quantization (symmetric)
Activations: FP8 dynamic per-token quantization (symmetric)
Visual Layers: Kept in full precision (bfloat16) for quality
Target Layers: Linear layers only

Quantization Configuration:

QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=['re:visual.*']  # Preserve visual layer quality
)

Model Specifications

Property	Value
Text Layers	28 layers (FP8 quantized)
Visual Layers	24 blocks (bfloat16, full precision)
Default Embedding Dim	2048
Matryoshka Dimensions	64 - 2048
Model Size	~2.8 GB (vs ~4.6 GB bfloat16)

Intended Uses

Supported Input Types

Text only
Image only
Text + Image (multimodal)

How to Get Started

Deployment with vLLM

vllm serve Qwen3-VL-Embedding-2B-FP8-DYNAMIC \
    --runner pooling \
    --convert embed \
    --hf-overrides '{"is_matryoshka": true}' \
    --quantization compressed-tensors \

Important Parameters:

--runner pooling: Enables embedding mode
--convert embed: Extract embeddings from model output
--quantization compressed-tensors: Load FP8 quantized weights
--hf-overrides '{"is_matryoshka": true}': Enable Matryoshka embedding support

Usage Examples

Text Embeddings (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

# Default 2048 dimensions
response = client.embeddings.create(
    model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
    input="Your text here",
    encoding_format="float"
)

embedding = response.data[0].embedding  # List[float] with 2048 dims

Matryoshka Embeddings (Custom Dimensions)

# Request smaller embedding for efficiency
response = client.embeddings.create(
    model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
    input="Your text here",
    encoding_format="float",
    dimensions=512  # Options: 128, 256, 512, 1024, 2048
)

embedding = response.data[0].embedding  # List[float] with 512 dims

Batch Processing

texts = [
    "First document to embed",
    "Second document to embed",
    "Third document to embed"
]

response = client.embeddings.create(
    model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
    input=texts,
    encoding_format="float",
    dimensions=256
)

# Access each embedding
for i, data in enumerate(response.data):
    embedding = data.embedding
    print(f"Document {i+1}: {len(embedding)} dimensions")

Multimodal Embeddings (Text + Image)

For image and multimodal embeddings, use vLLM's custom message format:

import base64

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# Image only
messages = [{
    "role": "user",
    "content": [{
        "type": "image_url",
        "image_url": {"url": f"data:image/jpeg;base64,{encode_image('image.jpg')}"}
    }]
}]

# Text + Image
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('image.jpg')}"}}
    ]
}]

# Call embeddings endpoint
response = client.post(
    "/embeddings",
    body={
        "messages": messages,
        "model": "Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
        "encoding_format": "float",
        "dimensions": 512
    },
    cast_to=object
)

embedding = response.data[0].embedding

Technical Notes

Why FP8_DYNAMIC?

FP8_DYNAMIC balances memory efficiency with quality by:

Statically quantizing weights to FP8 per-channel
Dynamically quantizing activations to FP8 per-token during inference
Avoiding accuracy loss from static activation quantization

Why Preserve Visual Layers?

Visual processing requires higher numerical precision. Keeping the 24 visual transformer blocks in bfloat16 ensures:

High-fidelity image feature extraction
Better cross-modal alignment
Minimal impact on multimodal task performance

The visual layers account for a smaller portion of total parameters, making this a worthwhile trade-off.

Reproducibility

To quantize the model yourself:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModel, AutoTokenizer

# Load base model
model = AutoModel.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    trust_remote_code=True,
    dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    trust_remote_code=True
)

# Configure quantization
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=['re:visual.*']  # Preserve visual layers
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
    output_dir="Qwen3-VL-Embedding-2B-FP8-DYNAMIC"
)

# Save
model.save_pretrained("Qwen3-VL-Embedding-2B-FP8-DYNAMIC")
tokenizer.save_pretrained("Qwen3-VL-Embedding-2B-FP8-DYNAMIC")

After quantization, run the tensor key fixing script (included in model repository).

License

This model inherits the Apache 2.0 license from the base Qwen3-VL-Embedding-2B model.

Acknowledgments

Base Model: Qwen Team for Qwen3-VL-Embedding-2B
Quantization Framework: Neural Magic for llmcompressor
Serving Engine: vLLM Team for high-performance inference

Model Card Authors

This model card was created by the quantization author.

Model Card Contact

For issues or questions about this quantized model, please open an issue on the model repository or contact via HuggingFace.

Downloads last month: 2,787

Safetensors

Model size

2B params

Tensor type

BF16

F8_E4M3

Model tree for alexliap/Qwen3-VL-Embedding-2B-FP8-DYNAMIC

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

Qwen/Qwen3-VL-Embedding-2B

Quantized

(12)

this model