Qwen3-VL-Embedding-2B-FP8-DYNAMIC

FP8 quantized version of Qwen3-VL-Embedding-2B optimized for efficient deployment with vLLM.

This model provides ~50% memory reduction compared to bfloat16 while maintaining high-quality multimodal embeddings through selective quantization.

Model Description

  • Base Model: Qwen/Qwen3-VL-Embedding-2B
  • Model Type: Multimodal Embedding Model (Text + Vision)
  • Quantization: FP8_DYNAMIC via llmcompressor
  • Architecture: Qwen3VLModel
  • Parameters: 2B
  • License: Apache 2.0

Key Features

  • Memory Efficient: ~50% VRAM reduction compared to bfloat16
  • Visual Quality Preserved: Full precision visual layers (24 blocks)
  • Matryoshka Embeddings: Supports flexible dimensions (from 64 to 2048)
  • OpenAI-Compatible API: Works with OpenAI SDK via vLLM

Quantization Details

Method: FP8_DYNAMIC

  • Weights: FP8 per-channel quantization (symmetric)
  • Activations: FP8 dynamic per-token quantization (symmetric)
  • Visual Layers: Kept in full precision (bfloat16) for quality
  • Target Layers: Linear layers only

Quantization Configuration:

QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=['re:visual.*']  # Preserve visual layer quality
)

Model Specifications

Property Value
Text Layers 28 layers (FP8 quantized)
Visual Layers 24 blocks (bfloat16, full precision)
Default Embedding Dim 2048
Matryoshka Dimensions 64 - 2048
Model Size ~2.8 GB (vs ~4.6 GB bfloat16)

Intended Uses

Supported Input Types

  • Text only
  • Image only
  • Text + Image (multimodal)

How to Get Started

Deployment with vLLM

vllm serve Qwen3-VL-Embedding-2B-FP8-DYNAMIC \
    --runner pooling \
    --convert embed \
    --hf-overrides '{"is_matryoshka": true}' \
    --quantization compressed-tensors \

Important Parameters:

  • --runner pooling: Enables embedding mode
  • --convert embed: Extract embeddings from model output
  • --quantization compressed-tensors: Load FP8 quantized weights
  • --hf-overrides '{"is_matryoshka": true}': Enable Matryoshka embedding support

Usage Examples

Text Embeddings (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url="http://localhost:8000/v1"
)

# Default 2048 dimensions
response = client.embeddings.create(
    model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
    input="Your text here",
    encoding_format="float"
)

embedding = response.data[0].embedding  # List[float] with 2048 dims

Matryoshka Embeddings (Custom Dimensions)

# Request smaller embedding for efficiency
response = client.embeddings.create(
    model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
    input="Your text here",
    encoding_format="float",
    dimensions=512  # Options: 128, 256, 512, 1024, 2048
)

embedding = response.data[0].embedding  # List[float] with 512 dims

Batch Processing

texts = [
    "First document to embed",
    "Second document to embed",
    "Third document to embed"
]

response = client.embeddings.create(
    model="Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
    input=texts,
    encoding_format="float",
    dimensions=256
)

# Access each embedding
for i, data in enumerate(response.data):
    embedding = data.embedding
    print(f"Document {i+1}: {len(embedding)} dimensions")

Multimodal Embeddings (Text + Image)

For image and multimodal embeddings, use vLLM's custom message format:

import base64

def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

# Image only
messages = [{
    "role": "user",
    "content": [{
        "type": "image_url",
        "image_url": {"url": f"data:image/jpeg;base64,{encode_image('image.jpg')}"}
    }]
}]

# Text + Image
messages = [{
    "role": "user",
    "content": [
        {"type": "text", "text": "Describe this image"},
        {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encode_image('image.jpg')}"}}
    ]
}]

# Call embeddings endpoint
response = client.post(
    "/embeddings",
    body={
        "messages": messages,
        "model": "Qwen3-VL-Embedding-2B-FP8-DYNAMIC",
        "encoding_format": "float",
        "dimensions": 512
    },
    cast_to=object
)

embedding = response.data[0].embedding

Technical Notes

Why FP8_DYNAMIC?

FP8_DYNAMIC balances memory efficiency with quality by:

  • Statically quantizing weights to FP8 per-channel
  • Dynamically quantizing activations to FP8 per-token during inference
  • Avoiding accuracy loss from static activation quantization

Why Preserve Visual Layers?

Visual processing requires higher numerical precision. Keeping the 24 visual transformer blocks in bfloat16 ensures:

  • High-fidelity image feature extraction
  • Better cross-modal alignment
  • Minimal impact on multimodal task performance

The visual layers account for a smaller portion of total parameters, making this a worthwhile trade-off.

Reproducibility

To quantize the model yourself:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModel, AutoTokenizer

# Load base model
model = AutoModel.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    trust_remote_code=True,
    dtype="bfloat16"
)
tokenizer = AutoTokenizer.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-2B",
    trust_remote_code=True
)

# Configure quantization
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=['re:visual.*']  # Preserve visual layers
)

# Apply quantization
oneshot(
    model=model,
    recipe=recipe,
    output_dir="Qwen3-VL-Embedding-2B-FP8-DYNAMIC"
)

# Save
model.save_pretrained("Qwen3-VL-Embedding-2B-FP8-DYNAMIC")
tokenizer.save_pretrained("Qwen3-VL-Embedding-2B-FP8-DYNAMIC")

After quantization, run the tensor key fixing script (included in model repository).

License

This model inherits the Apache 2.0 license from the base Qwen3-VL-Embedding-2B model.

Acknowledgments

  • Base Model: Qwen Team for Qwen3-VL-Embedding-2B
  • Quantization Framework: Neural Magic for llmcompressor
  • Serving Engine: vLLM Team for high-performance inference

Model Card Authors

This model card was created by the quantization author.

Model Card Contact

For issues or questions about this quantized model, please open an issue on the model repository or contact via HuggingFace.

Downloads last month
2,748
Safetensors
Model size
2B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexliap/Qwen3-VL-Embedding-2B-FP8-DYNAMIC

Quantized
(12)
this model