Add config_format and load_format to vLLM args
#5
by
mgoin
- opened
README.md
CHANGED
|
@@ -338,7 +338,7 @@ We recommend to use Pixtral-Large-Instruct-2411 in a server/client setting.
|
|
| 338 |
1. Spin up a server:
|
| 339 |
|
| 340 |
```
|
| 341 |
-
vllm serve mistralai/Pixtral-Large-Instruct-2411 --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8
|
| 342 |
```
|
| 343 |
|
| 344 |
2. And ping the client:
|
|
@@ -523,7 +523,7 @@ messages = [
|
|
| 523 |
sampling_params = SamplingParams(max_tokens=512)
|
| 524 |
|
| 525 |
# note that running this model on GPU requires over 300 GB of GPU RAM
|
| 526 |
-
llm = LLM(model=model_name, tokenizer_mode="mistral", tensor_parallel_size=8, limit_mm_per_prompt={"image": 4})
|
| 527 |
|
| 528 |
outputs = llm.chat(messages, sampling_params=sampling_params)
|
| 529 |
|
|
|
|
| 338 |
1. Spin up a server:
|
| 339 |
|
| 340 |
```
|
| 341 |
+
vllm serve mistralai/Pixtral-Large-Instruct-2411 --config-format mistral --load-format mistral --tokenizer_mode mistral --limit_mm_per_prompt 'image=10' --tensor-parallel-size 8
|
| 342 |
```
|
| 343 |
|
| 344 |
2. And ping the client:
|
|
|
|
| 523 |
sampling_params = SamplingParams(max_tokens=512)
|
| 524 |
|
| 525 |
# note that running this model on GPU requires over 300 GB of GPU RAM
|
| 526 |
+
llm = LLM(model=model_name, config_format="mistral", load_format="mistral", tokenizer_mode="mistral", tensor_parallel_size=8, limit_mm_per_prompt={"image": 4})
|
| 527 |
|
| 528 |
outputs = llm.chat(messages, sampling_params=sampling_params)
|
| 529 |
|