RamManavalan
/

Qwen3-VL-Embedding-8B-FP8

Feature Extraction

image-text-to-text

compressed-tensors

multimodal embedding

Model card Files Files and versions

RamManavalan commited on Jan 22

Commit

022f670

·

verified ·

1 Parent(s): 12ccf84

vLLM >= 0.14.0 Support + ReadMe change

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -72,7 +72,7 @@ This is an **FP8 quantized** version of [Qwen/Qwen3-VL-Embedding-8B](https://hug
 ## Usage
-### With vLLM (Recommended)
 ```python
 from vllm import LLM, EngineArgs
@@ -116,7 +116,7 @@ similarity = embeddings[0] @ embeddings[1]
 print(f"Similarity: {similarity:.4f}")
 ```
-### With vLLM Server
 ```bash
 # Start the server
@@ -132,11 +132,11 @@ curl http://localhost:8000/v1/embeddings \
 ```python
 import torch
-from transformers import AutoProcessor, AutoModel
-model = AutoModel.from_pretrained(
     "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
-    torch_dtype=torch.bfloat16,
     trust_remote_code=True,
     device_map="auto",
 )
@@ -155,7 +155,7 @@ inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.de
 # Get embedding (last-token pooling)
 with torch.no_grad():
-    outputs = model(**inputs)
     # Get the last non-padding token
     seq_len = inputs['attention_mask'].sum(dim=1) - 1
     embedding = outputs.last_hidden_state[0, seq_len[0]]
@@ -200,12 +200,12 @@ The base model achieves state-of-the-art performance on multimodal benchmarks:
 This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):
 ```python
-from transformers import AutoModel, AutoProcessor
 from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import QuantizationModifier
 # Load model
-model = AutoModel.from_pretrained(
     "Qwen/Qwen3-VL-Embedding-8B",
     torch_dtype=torch.bfloat16,
     trust_remote_code=True,

 ## Usage
+### With vLLM (>=0.14.0) (Recommended)
 ```python
 from vllm import LLM, EngineArgs
 print(f"Similarity: {similarity:.4f}")
 ```
+### With vLLM (>=0.14.0) Server
 ```bash
 # Start the server
 ```python
 import torch
+from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
+model = Qwen3VLForConditionalGeneration.from_pretrained(
     "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
+    dtype=torch.bfloat16,
     trust_remote_code=True,
     device_map="auto",
 )
 # Get embedding (last-token pooling)
 with torch.no_grad():
+    outputs = model.model(**inputs, output_hidden_states=True)
     # Get the last non-padding token
     seq_len = inputs['attention_mask'].sum(dim=1) - 1
     embedding = outputs.last_hidden_state[0, seq_len[0]]
 This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):
 ```python
+from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
 from llmcompressor import oneshot
 from llmcompressor.modifiers.quantization import QuantizationModifier
 # Load model
+model = Qwen3VLForConditionalGeneration.from_pretrained(
     "Qwen/Qwen3-VL-Embedding-8B",
     torch_dtype=torch.bfloat16,
     trust_remote_code=True,