Qwen3-VL-Embedding-8B-FP8

This is an FP8 quantized version of Qwen/Qwen3-VL-Embedding-8B, optimized for efficient inference with vLLM.

Model Overview

Attribute	Value
Base Model	Qwen/Qwen3-VL-Embedding-8B
Quantization	FP8 Dynamic (W8A8)
Original Size	~16 GB (BF16)
Quantized Size	~9 GB (FP8)
Memory Savings	~45%
Embedding Dimension	4096
Supported Inputs	Text, Images, Videos, Multimodal
Context Length	32K tokens

Highlights

Multimodal Versatility: Handles text, images, screenshots, and video inputs
Efficient Inference: ~45% memory reduction with minimal accuracy loss
vLLM Compatible: Works with vLLM's pooling runner for high-throughput embedding
No Calibration Required: Uses FP8_DYNAMIC scheme (data-free quantization)

Quantization Details

Component	Precision	Notes
Vision Encoder (ViT)	BF16	Preserved for accuracy
LLM Decoder Layers	FP8	Quantized for efficiency
Embeddings	BF16	Preserved

Scheme: FP8_DYNAMIC
- Weights: FP8_E4M3 (per-channel quantization)
- Activations: Dynamic per-token quantization at runtime
Tool: llm-compressor
Calibration: None required (data-free quantization)

Hardware Requirements

GPU: NVIDIA GPU with FP8 support (compute capability >= 8.9)
- Blackwell: RTX 5090, RTX 5080
- Ada Lovelace: RTX 4090, RTX 4080
- Hopper: H100, H200
VRAM: ~10GB minimum for inference

Usage

With vLLM (>=0.14.0) (Recommended)

from vllm import LLM, EngineArgs
import numpy as np

# Initialize vLLM with pooling runner
engine_args = EngineArgs(
    model="RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    runner="pooling",
    dtype="bfloat16",
    trust_remote_code=True,
)
llm = LLM(**vars(engine_args))

# Prepare inputs
tokenizer = llm.get_tokenizer()

def format_input(text, instruction="Represent the user's input."):
    conversation = [
        {"role": "system", "content": [{"type": "text", "text": instruction}]},
        {"role": "user", "content": [{"type": "text", "text": text}]}
    ]
    prompt = tokenizer.apply_chat_template(
        conversation, tokenize=False, add_generation_prompt=True
    )
    return {"prompt": prompt}

# Get embeddings
inputs = [
    format_input("A woman playing with her dog on the beach."),
    format_input("Machine learning for image classification."),
]
outputs = llm.embed(inputs)

# Extract embeddings
embeddings = np.array([o.outputs.embedding for o in outputs])
print(f"Embeddings shape: {embeddings.shape}")  # (2, 4096)

# Compute similarity
similarity = embeddings[0] @ embeddings[1]
print(f"Similarity: {similarity:.4f}")

With vLLM (>=0.14.0) Server

# Start the server
vllm serve RamManavalan/Qwen3-VL-Embedding-8B-FP8 --task embed

# Query via API
curl http://localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{"input": "Your text here", "model": "RamManavalan/Qwen3-VL-Embedding-8B-FP8"}'

With Transformers

import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(
    "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
    trust_remote_code=True,
)

# Prepare input
messages = [
    {"role": "system", "content": [{"type": "text", "text": "Represent the user's input."}]},
    {"role": "user", "content": [{"type": "text", "text": "Your text here"}]}
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.device)

# Get embedding (last-token pooling)
with torch.no_grad():
    outputs = model.model(**inputs, output_hidden_states=True)
    # Get the last non-padding token
    seq_len = inputs['attention_mask'].sum(dim=1) - 1
    embedding = outputs.last_hidden_state[0, seq_len[0]]
    embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)

print(f"Embedding shape: {embedding.shape}")  # (4096,)

Using the Helper Class

This repository includes a helper class for easier embedding extraction:

from scripts.qwen3_vl_embedding import Qwen3VLEmbedder

# Initialize
model = Qwen3VLEmbedder(model_name_or_path="RamManavalan/Qwen3-VL-Embedding-8B-FP8")

# Get embeddings for text, images, or multimodal inputs
inputs = [
    {"text": "A dog on the beach"},
    {"image": "path/to/image.jpg"},
    {"text": "What is in this image?", "image": "path/to/image.jpg"},
]
embeddings = model.process(inputs)
print(f"Embeddings shape: {embeddings.shape}")  # (3, 4096)

Benchmark Results

The base model achieves state-of-the-art performance on multimodal benchmarks:

Benchmark	Score
MMEB-V2 Overall	77.9
MMTEB Mean	67.88

FP8 quantization typically preserves >95% of the original model's accuracy.

Creation

This model was quantized using llm-compressor:

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen3-VL-Embedding-8B",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)

# FP8 quantization recipe (data-free)
recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=[
        "lm_head",
        r"re:model\.visual\..*",  # Keep vision encoder in BF16
    ]
)

# Apply quantization
oneshot(model=model, recipe=recipe)

# Save
model.save_pretrained("Qwen3-VL-Embedding-8B-FP8", save_compressed=True)

Citation

If you use this model, please cite the original Qwen3-VL-Embedding paper:

@article{qwen3vlembedding,
  title={Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking},
  author={Li, Mingxin and Zhang, Yanzhao and Long, Dingkun and Chen, Keqin and Song, Sibo and Bai, Shuai and Yang, Zhibo and Xie, Pengjun and Yang, An and Liu, Dayiheng and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2601.04720},
  year={2026}
}

License

Apache 2.0 (same as base model)

Acknowledgments

Qwen Team for the original Qwen3-VL-Embedding model
vLLM Team for the inference engine
Neural Magic for llm-compressor

Downloads last month: 203,079

Safetensors

Model size

9B params

Tensor type

BF16

F8_E4M3

Model tree for RamManavalan/Qwen3-VL-Embedding-8B-FP8

Base model

Qwen/Qwen3-VL-8B-Instruct

Finetuned

Qwen/Qwen3-VL-Embedding-8B

Quantized

(21)

this model

Paper for RamManavalan/Qwen3-VL-Embedding-8B-FP8

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Paper • 2601.04720 • Published Jan 8 • 59