vLLM >= 0.14.0 Support + ReadMe change
Browse files
README.md
CHANGED
|
@@ -72,7 +72,7 @@ This is an **FP8 quantized** version of [Qwen/Qwen3-VL-Embedding-8B](https://hug
|
|
| 72 |
|
| 73 |
## Usage
|
| 74 |
|
| 75 |
-
### With vLLM (Recommended)
|
| 76 |
|
| 77 |
```python
|
| 78 |
from vllm import LLM, EngineArgs
|
|
@@ -116,7 +116,7 @@ similarity = embeddings[0] @ embeddings[1]
|
|
| 116 |
print(f"Similarity: {similarity:.4f}")
|
| 117 |
```
|
| 118 |
|
| 119 |
-
### With vLLM Server
|
| 120 |
|
| 121 |
```bash
|
| 122 |
# Start the server
|
|
@@ -132,11 +132,11 @@ curl http://localhost:8000/v1/embeddings \
|
|
| 132 |
|
| 133 |
```python
|
| 134 |
import torch
|
| 135 |
-
from transformers import AutoProcessor,
|
| 136 |
|
| 137 |
-
model =
|
| 138 |
"RamManavalan/Qwen3-VL-Embedding-8B-FP8",
|
| 139 |
-
|
| 140 |
trust_remote_code=True,
|
| 141 |
device_map="auto",
|
| 142 |
)
|
|
@@ -155,7 +155,7 @@ inputs = processor(text=[prompt], return_tensors="pt", padding=True).to(model.de
|
|
| 155 |
|
| 156 |
# Get embedding (last-token pooling)
|
| 157 |
with torch.no_grad():
|
| 158 |
-
outputs = model(**inputs)
|
| 159 |
# Get the last non-padding token
|
| 160 |
seq_len = inputs['attention_mask'].sum(dim=1) - 1
|
| 161 |
embedding = outputs.last_hidden_state[0, seq_len[0]]
|
|
@@ -200,12 +200,12 @@ The base model achieves state-of-the-art performance on multimodal benchmarks:
|
|
| 200 |
This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):
|
| 201 |
|
| 202 |
```python
|
| 203 |
-
from transformers import
|
| 204 |
from llmcompressor import oneshot
|
| 205 |
from llmcompressor.modifiers.quantization import QuantizationModifier
|
| 206 |
|
| 207 |
# Load model
|
| 208 |
-
model =
|
| 209 |
"Qwen/Qwen3-VL-Embedding-8B",
|
| 210 |
torch_dtype=torch.bfloat16,
|
| 211 |
trust_remote_code=True,
|
|
|
|
| 72 |
|
| 73 |
## Usage
|
| 74 |
|
| 75 |
+
### With vLLM (>=0.14.0) (Recommended)
|
| 76 |
|
| 77 |
```python
|
| 78 |
from vllm import LLM, EngineArgs
|
|
|
|
| 116 |
print(f"Similarity: {similarity:.4f}")
|
| 117 |
```
|
| 118 |
|
| 119 |
+
### With vLLM (>=0.14.0) Server
|
| 120 |
|
| 121 |
```bash
|
| 122 |
# Start the server
|
|
|
|
| 132 |
|
| 133 |
```python
|
| 134 |
import torch
|
| 135 |
+
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
|
| 136 |
|
| 137 |
+
model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 138 |
"RamManavalan/Qwen3-VL-Embedding-8B-FP8",
|
| 139 |
+
dtype=torch.bfloat16,
|
| 140 |
trust_remote_code=True,
|
| 141 |
device_map="auto",
|
| 142 |
)
|
|
|
|
| 155 |
|
| 156 |
# Get embedding (last-token pooling)
|
| 157 |
with torch.no_grad():
|
| 158 |
+
outputs = model.model(**inputs, output_hidden_states=True)
|
| 159 |
# Get the last non-padding token
|
| 160 |
seq_len = inputs['attention_mask'].sum(dim=1) - 1
|
| 161 |
embedding = outputs.last_hidden_state[0, seq_len[0]]
|
|
|
|
| 200 |
This model was quantized using [llm-compressor](https://github.com/vllm-project/llm-compressor):
|
| 201 |
|
| 202 |
```python
|
| 203 |
+
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
|
| 204 |
from llmcompressor import oneshot
|
| 205 |
from llmcompressor.modifiers.quantization import QuantizationModifier
|
| 206 |
|
| 207 |
# Load model
|
| 208 |
+
model = Qwen3VLForConditionalGeneration.from_pretrained(
|
| 209 |
"Qwen/Qwen3-VL-Embedding-8B",
|
| 210 |
torch_dtype=torch.bfloat16,
|
| 211 |
trust_remote_code=True,
|