RamManavalan commited on
Commit
022f670
·
verified ·
1 Parent(s): 12ccf84

vLLM >= 0.14.0 Support + ReadMe change

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -72,7 +72,7 @@ This is an **FP8 quantized** version of [Qwen/Qwen3-VL-Embedding-8B](https://hug
72
 
73
  ## Usage
74
 
75
- ### With vLLM (Recommended)
76
 
77
  ```python
78
  from vllm import LLM, EngineArgs
@@ -116,7 +116,7 @@ similarity = embeddings[0] @ embeddings[1]
116
  print(f"Similarity: {similarity:.4f}")
117
  ```
118
 
119
- ### With vLLM Server
120
 
121
  ```bash
122
  # Start the server
@@ -132,11 +132,11 @@ curl http://localhost:8000/v1/embeddings \
132
 
133
  ```python
134
  import torch
135
- from transformers import AutoProcessor, AutoModel
136
 
137
- model = AutoModel.from_pretrained(
138
  "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
139
- torch_dtype=torch.bfloat16,
140
  trust_remote_code=True,
141
  device_map="auto",
142
  )
@@ -155,7 +155,7 @@ inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.de
155
 
156
  # Get embedding (last-token pooling)
157
  with torch.no_grad():
158
- outputs = model(**inputs)
159
  # Get the last non-padding token
160
  seq_len = inputs['attention_mask'].sum(dim=1) - 1
161
  embedding = outputs.last_hidden_state[0, seq_len[0]]
@@ -200,12 +200,12 @@ The base model achieves state-of-the-art performance on multimodal benchmarks:
200
  This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):
201
 
202
  ```python
203
- from transformers import AutoModel, AutoProcessor
204
  from llmcompressor import oneshot
205
  from llmcompressor.modifiers.quantization import QuantizationModifier
206
 
207
  # Load model
208
- model = AutoModel.from_pretrained(
209
  "Qwen/Qwen3-VL-Embedding-8B",
210
  torch_dtype=torch.bfloat16,
211
  trust_remote_code=True,
 
72
 
73
  ## Usage
74
 
75
+ ### With vLLM (>=0.14.0) (Recommended)
76
 
77
  ```python
78
  from vllm import LLM, EngineArgs
 
116
  print(f"Similarity: {similarity:.4f}")
117
  ```
118
 
119
+ ### With vLLM (>=0.14.0) Server
120
 
121
  ```bash
122
  # Start the server
 
132
 
133
  ```python
134
  import torch
135
+ from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
136
 
137
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
138
  "RamManavalan/Qwen3-VL-Embedding-8B-FP8",
139
+ dtype=torch.bfloat16,
140
  trust_remote_code=True,
141
  device_map="auto",
142
  )
 
155
 
156
  # Get embedding (last-token pooling)
157
  with torch.no_grad():
158
+ outputs = model.model(**inputs, output_hidden_states=True)
159
  # Get the last non-padding token
160
  seq_len = inputs['attention_mask'].sum(dim=1) - 1
161
  embedding = outputs.last_hidden_state[0, seq_len[0]]
 
200
  This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):
201
 
202
  ```python
203
+ from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
204
  from llmcompressor import oneshot
205
  from llmcompressor.modifiers.quantization import QuantizationModifier
206
 
207
  # Load model
208
+ model = Qwen3VLForConditionalGeneration.from_pretrained(
209
  "Qwen/Qwen3-VL-Embedding-8B",
210
  torch_dtype=torch.bfloat16,
211
  trust_remote_code=True,