Qwen3-VL-Embedding-8B-FP8

This is an FP8 quantized version of Qwen/Qwen3-VL-Embedding-8B, optimized for efficient inference with vLLM.

Model Overview

Attribute Value
Base Model Qwen/Qwen3-VL-Embedding-8B
Quantization FP8 Dynamic (W8A8)
Original Size ~16 GB (BF16)
Quantized Size ~9 GB (FP8)
Memory Savings ~45%
Embedding Dimension 4096
Supported Inputs Text, Images, Videos, Multimodal
Context Length 32K tokens

Highlights

  • Multimodal Versatility: Handles text, images, screenshots, and video inputs
  • Efficient Inference: ~45% memory reduction with minimal accuracy loss
  • vLLM Compatible: Works with vLLM's pooling runner for high-throughput embedding
  • No Calibration Required: Uses FP8_DYNAMIC scheme (data-free quantization)

Quantization Details

Component Precision Notes
Vision Encoder (ViT) BF16 Preserved for accuracy
LLM Decoder Layers FP8 Quantized for efficiency
Embeddings BF16 Preserved
  • Scheme: FP8_DYNAMIC
    • Weights: FP8_E4M3 (per-channel quantization)
    • Activations: Dynamic per-token quantization at runtime
  • Tool: llm-compressor
  • Calibration: None required (data-free quantization)

Hardware Requirements

  • GPU: NVIDIA GPU with FP8 support (compute capability >= 8.9)
    • Blackwell: RTX 5090, RTX 5080
    • Ada Lovelace: RTX 4090, RTX 4080
    • Hopper: H100, H200
  • VRAM: ~10GB minimum for inference

Usage

With vLLM (>=0.14.0) (Recommended)

from vllm import LLM, EngineArgs
import numpy as np

# Initialize vLLM with pooling runner
engine_args = EngineArgs(
    model="RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    runner="pooling",
    dtype="bfloat16",
    trust_remote_code=True,
)
llm = LLM(**vars(engine_args))

# Prepare inputs
tokenizer = llm.get_tokenizer()

def format_input(text, instruction="Represent the user's input."):
    conversation = [
        {"role": "system", "content": [{"type": "text", "text": instruction}]},
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    prompt = tokenizer.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    return {"prompt": prompt}

# Get embeddings
inputs = [
    format_input("A woman playing with her dog on the beach."),
    format_input("Machine learning for image classification."),
]
outputs = llm.embed(inputs)

# Extract embeddings
embeddings = np.array([o.outputs.embedding for o in outputs])
print(f"Embeddings shape: {embeddings.shape}")  # (2, 4096)

# Compute similarity
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")

With vLLM (>=0.14.0) Server

# Start the server
vllm serve RamManavalan/Qwen3-VL-Embedding-8B-FP8 --task embed

# Query via API
curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "Your text here", "model": "RamManavalan/Qwen3-VL-Embedding-8B-FP8"}'

With Transformers

import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(
    "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    trust_remote_code=True,
)

# Prepare input
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Represent the user's input."}]},
    {"role": "user", "content": [{"type": "text", "text": "Your text here"}]}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.device)

# Get embedding (last-token pooling)
with torch.no_grad():
    outputs = model.model(**inputs, output_hidden_states=True)
    # Get the last non-padding token
    seq_len = inputs['attention_mask'].sum(dim=1) - 1
    embedding = outputs.last_hidden_state[0, seq_len[0]]
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)

print(f"Embedding shape: {embedding.shape}")  # (4096,)

Using the Helper Class

This repository includes a helper class for easier embedding extraction:

from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

# Initialize
model = Qwen3VLEmbedder(model_name_or_path="RamManavalan/Qwen3-VL-Embedding-8B-FP8")

# Get embeddings for text, images, or multimodal inputs
inputs = [
    {"text": "A dog on the beach"},
    {"image": "path/to/image.jpg"},
    {"text": "What is in this image?", "image": "path/to/image.jpg"},
]
embeddings = model.process(inputs)
print(f"Embeddings shape: {embeddings.shape}")  # (3, 4096)

Benchmark Results

The base model achieves state-of-the-art performance on multimodal benchmarks:

Benchmark Score
MMEB-V2 Overall 77.9
MMTEB Mean 67.88

FP8 quantization typically preserves >95% of the original model's accuracy.

Creation

This model was quantized using llm-compressor:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-8B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

# FP8 quantization recipe (data-free)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        r"re:model\.visual\..*",  # Keep vision encoder in BF16
    ]
)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Save
model.save_pretrained("Qwen3-VL-Embedding-8B-FP8", save_compressed=True)

Citation

If you use this model, please cite the original Qwen3-VL-Embedding paper:

@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2601.04720},
  year={2026}
}

License

Apache 2.0 (same as base model)

Acknowledgments

Downloads last month
121,000
Safetensors
Model size
9B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RamManavalan/Qwen3-VL-Embedding-8B-FP8

Quantized
(6)
this model

Paper for RamManavalan/Qwen3-VL-Embedding-8B-FP8