Update README.md
Browse files
README.md
CHANGED
|
@@ -209,72 +209,62 @@ print("\nGenerated Output:\n", decoded_output)
|
|
| 209 |
```
|
| 210 |
|
| 211 |
### Using vLLM (for optimized FP8 inference)
|
| 212 |
-
> [!NOTE]
|
| 213 |
-
> The following vLLM inference example is provided as a general guideline based on `llm-compressor`'s intended compatibility with vLLM for FP8 models. However, at the time of writing, this specific quantized model (`textgeflecht/Devstral-Small-2505-FP8-llmcompressor`) with its `MistralTokenizer` (using `tekken.json`) **has not been explicitly tested with vLLM by the author.**
|
| 214 |
-
>
|
| 215 |
-
> Successfully running this model with vLLM might require:
|
| 216 |
-
> * The latest version of vLLM with robust support for `compressed-tensors` format and custom tokenizers.
|
| 217 |
-
> * Potential adjustments to tokenizer loading or prompt formatting to align with vLLM's expectations for models using `MistralTokenizer`.
|
| 218 |
-
> * Consulting the [vLLM documentation](https://docs.vllm.ai/en/latest/) for the most up-to-date instructions on running quantized models and handling custom tokenizers.
|
| 219 |
-
>
|
| 220 |
-
> Users are encouraged to experiment and consult vLLM resources if they encounter issues.
|
| 221 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 222 |
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
``
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
# Define your model ID
|
| 230 |
-
MODEL_REPO_ID = "textgeflecht/Devstral-Small-2505-FP8-llmcompressor"
|
| 231 |
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
# and point to the tekken.json or use a HuggingFace compatible tokenizer wrapper if needed.
|
| 238 |
-
|
| 239 |
-
# For MistralTokenizer, vLLM might require manual handling or a specific setup.
|
| 240 |
-
# If vLLM doesn't directly support MistralTokenizer from tekken.json,
|
| 241 |
-
# you might need to use a HF AutoTokenizer equivalent IF one can be made compatible,
|
| 242 |
-
# or wait for broader custom tokenizer support in vLLM.
|
| 243 |
-
|
| 244 |
-
# Simplest case (assuming vLLM handles the compressed-tensors format and tokenizer):
|
| 245 |
-
llm = LLM(model=MODEL_REPO_ID, trust_remote_code=True)
|
| 246 |
-
# If tekken.json is present, vLLM might use it if it knows the tokenizer type.
|
| 247 |
-
# Or, if a compatible tokenizer_config.json can be created for MistralTokenizer.
|
| 248 |
-
|
| 249 |
-
# For now, a more robust way with vLLM and custom tokenizers like Mistral's
|
| 250 |
-
# might involve more direct specification or ensuring the saved format is fully
|
| 251 |
-
# vLLM auto-detectable. If issues arise, consulting vLLM documentation for custom/FP8 models is key.
|
| 252 |
-
|
| 253 |
-
# --- Placeholder for vLLM with MistralTokenizer ---
|
| 254 |
-
# Direct vLLM usage with MistralTokenizer from a tekken.json might be tricky
|
| 255 |
-
# as vLLM typically relies on Hugging Face AutoTokenizer.
|
| 256 |
-
# One approach could be to use the HF AutoModelForCausalLM loading mechanism
|
| 257 |
-
# and then pass it to vLLM if supported, or ensure the saved format is directly consumable.
|
| 258 |
-
|
| 259 |
-
# Let's assume for now a standard prompt for demonstration if vLLM picks up a compatible tokenizer.
|
| 260 |
-
# If not, you'd use MistralTokenizer to prepare the prompt as in the transformers example.
|
| 261 |
-
|
| 262 |
-
prompts = [
|
| 263 |
-
"[INST] Write a python function that calculates the factorial of a number. [/INST]",
|
| 264 |
-
# Add more prompts if needed
|
| 265 |
-
]
|
| 266 |
|
| 267 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
-
|
| 270 |
-
|
| 271 |
-
# Print the outputs
|
| 272 |
-
for output in outputs:
|
| 273 |
-
prompt = output.prompt
|
| 274 |
-
generated_text = output.outputs[0].text
|
| 275 |
-
print(f"Prompt: {prompt!r}")
|
| 276 |
-
print(f"Generated text: {generated_text!r}\n---")
|
| 277 |
-
```
|
| 278 |
|
| 279 |
-
Original Model Card (mistralai/Devstral-Small-2505)
|
| 280 |
For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505
|
|
|
|
| 209 |
```
|
| 210 |
|
| 211 |
### Using vLLM (for optimized FP8 inference)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 212 |
|
| 213 |
+
This model, quantized to FP8 with `llm-compressor`, is designed for efficient inference with vLLM, especially on newer NVIDIA GPUs.
|
| 214 |
+
|
| 215 |
+
**Prerequisites:**
|
| 216 |
+
* A recent version of vLLM. The author's successful tests used a custom, very recent build of vLLM with specific patches for NVIDIA Blackwell FP8 support.
|
| 217 |
+
* A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer architectures are recommended for FP8).
|
| 218 |
+
* Docker and NVIDIA Container Toolkit installed.
|
| 219 |
+
|
| 220 |
+
**Running with Docker (Recommended & Tested by Author):**
|
| 221 |
+
|
| 222 |
+
The following Docker command starts a vLLM OpenAI-compatible server with this quantized model. This setup has been verified by the author to load the model successfully and serve requests.
|
| 223 |
+
|
| 224 |
+
```bash
|
| 225 |
+
# 1. Set your Hugging Face Token (optional, but recommended to avoid rate limits or for private models)
|
| 226 |
+
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
|
| 227 |
+
|
| 228 |
+
# 2. Run the vLLM Docker container.
|
| 229 |
+
# Replace 'vllm/vllm-openai:latest' with your specific vLLM image if using a custom build.
|
| 230 |
+
# The 'latest' tag should pull a recent official build from vLLM.
|
| 231 |
+
sudo docker run --gpus all \
|
| 232 |
+
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
|
| 233 |
+
-p 8000:8000 \
|
| 234 |
+
-e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
|
| 235 |
+
vllm/vllm-openai:latest \
|
| 236 |
+
--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor \
|
| 237 |
+
--tokenizer_mode mistral \
|
| 238 |
+
--load_format auto \
|
| 239 |
+
--max-model-len 2048 # Optional: Limit VRAM usage
|
| 240 |
+
|
| 241 |
+
# Optional: Add Mistral-specific tool usage flags if needed by your application.
|
| 242 |
+
# --tool-call-parser mistral \
|
| 243 |
+
# --enable-auto-tool-choice \
|
| 244 |
+
# Optional: Explicitly set tensor parallel size if you have multiple GPUs e.g. on 2 GPUs:
|
| 245 |
+
# --tensor-parallel-size 2
|
| 246 |
+
```
|
| 247 |
+
Key Command-Line Arguments Used:
|
| 248 |
|
| 249 |
+
- `--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor`: Specifies the quantized model from the Hugging Face Hub.
|
| 250 |
+
- `--tokenizer_mode mistral`: Essential for vLLM to correctly use the MistralTokenizer with the `tekken.json` file from the repository.
|
| 251 |
+
- `--load_format auto`: Allows vLLM to auto-detect the Hugging Face sharded safetensors format for weights. With this, vLLM successfully reads the `config.json` (which includes `quantization_config` with `quant_method: "compressed-tensors"`) and auto-detects the FP8 quantization scheme.
|
| 252 |
+
- `--max-model-len 2048`: Limits the maximum sequence length (input + output tokens combined) to manage VRAM. Adjust this value based on your needs and available GPU memory.
|
| 253 |
+
- The flags `--tool-call-parser mistral` and `--enable-auto-tool-choice` can be added if you intend to use Devstral's tool-calling capabilities.
|
|
|
|
|
|
|
|
|
|
| 254 |
|
| 255 |
+
Note on FP8 Support (especially for newer architectures like Blackwell):
|
| 256 |
+
- vLLM's support for FP8, particularly on the newest GPU architectures like NVIDIA Blackwell, is an area of active development.
|
| 257 |
+
- The successful tests for this model on Blackwell used a custom, very recent vLLM build with specific patches for Blackwell FP8 support.
|
| 258 |
+
- While standard `vllm/vllm-openai:latest` images are updated regularly, cutting-edge hardware support and specific quantization schemes might take time to be fully integrated and stabilized in official releases.
|
| 259 |
+
- If you encounter issues related to FP8 performance or compatibility on very new hardware with official vLLM builds, it's recommended to check the vLLM GitHub repository issues and discussions for the latest status, potential workarounds, or information on required builds.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 260 |
|
| 261 |
+
Interacting with the Server:
|
| 262 |
+
- Once the vLLM server is running, it exposes an OpenAI-compatible API.
|
| 263 |
+
- You can interact with it using any OpenAI client library (like openai for Python) or tools like curl.
|
| 264 |
+
- Endpoint for chat completions: `http://localhost:8000/v1/chat/completions`
|
| 265 |
+
- Model name in requests: Use `textgeflecht/Devstral-Small-2505-FP8-llmcompressor`
|
| 266 |
|
| 267 |
+
Refer to the Python requests example in the original https://huggingface.co/mistralai/Devstral-Small-2505 model card for client-side interaction, adjusting the URL and model name as needed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 268 |
|
| 269 |
+
### Original Model Card (mistralai/Devstral-Small-2505)
|
| 270 |
For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505
|