File size: 12,243 Bytes
a688c92 d5e53c2 a688c92 d5e53c2 a688c92 86f8409 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 5a1abd7 a688c92 d0f2451 e3c75eb a688c92 d0f2451 a688c92 b9a95b5 a688c92 d0f2451 a688c92 d0f2451 a688c92 b9a95b5 4221bd9 b9a95b5 4221bd9 a688c92 4221bd9 a688c92 4221bd9 a688c92 4221bd9 a688c92 4221bd9 d5e53c2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 | ---
license: apache-2.0
tags:
- mistral
- devstral
- fp8
- quantization
- llm-compressor
- vllm
pipeline_tag: text-generation
base_model:
- mistralai/Devstral-Small-2505
---
# Devstral-Small-2505-FP8-Dynamic
This is a version of [mistralai/Devstral-Small-2505](https://huggingface.co/mistralai/Devstral-Small-2505) quantized to FP8 (weights and dynamic activations) using [llm-compressor](https://github.com/vllm-project/llm-compressor).
This model format is particularly useful for accelerated inference with [vLLM](https://github.com/vllm-project/vllm) on NVIDIA GPUs with compute capability > 8.9 (Ada Lovelace, Hopper, Blackwell or newer).
## Model Description
Devstral is a cutting-edge, versatile language model developed by Mistral AI, fine-tuned for development tasks. This version has been quantized to FP8 precision for weights (static, per-channel) and activations (dynamic, per-token), with the `lm_head` layer kept in its original precision.
## Quantization with llm-compressor
The model was quantized using the `oneshot` method from `llm-compressor` with the `FP8_DYNAMIC` scheme.
No calibration dataset was required for this quantization scheme.
The following script was used for conversion:
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from huggingface_hub import hf_hub_download
import shutil
import os
MODEL_ID = "mistralai/Devstral-Small-2505"
# Load model.
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
#tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tekken_file = hf_hub_download(repo_id=MODEL_ID, filename="tekken.json")
tokenizer = MistralTokenizer.from_file(tekken_file)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp8 with per channel via ptq
# * quantize the activations to fp8 with dynamic per token
recipe = QuantizationModifier(
targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)
# Apply quantization.
oneshot(model=model, recipe=recipe, tokenizer=tokenizer)
# Confirm generations of the quantized model look sane.
print("========== SAMPLE GENERATION ==============")
prompt = "The capital of France is"
# Create a ChatCompletionRequest with a single UserMessage
chat_request = ChatCompletionRequest(
messages=[
UserMessage(content=prompt)
]
)
# Encode the request using the tokenizer's specific method
tokenized_payload = tokenizer.encode_chat_completion(chat_request)
# The actual token IDs are usually in an attribute like '.tokens'
encoded_prompt_ids = tokenized_payload.tokens
# Convert to a PyTorch tensor and move to the model's device.
input_ids = torch.tensor([encoded_prompt_ids], device=model.device)
# Generate output
output = model.generate(input_ids, max_new_tokens=20)
# Decode the output
# The output from model.generate includes the input_ids.
# To get only the newly generated tokens, you might need to slice it:
# generated_token_ids = output[0][len(encoded_prompt_ids):]
# However, for a simple check, decoding the whole thing is often fine initially.
# If the output includes the prompt, that's expected.
print(tokenizer.decode(output[0].tolist()))
print("==========================================")
# Save to disk in compressed-tensors format.
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
model.save_pretrained(SAVE_DIR) # This saves the quantized model
# --- Correct way to "save" the MistralTokenizer ---
# Ensure the save directory exists
if not os.path.exists(SAVE_DIR):
os.makedirs(SAVE_DIR)
# Define the destination path for tekken.json within your SAVE_DIR
destination_tekken_file = os.path.join(SAVE_DIR, "tekken.json")
# Copy the tekken.json file from its original download location to your SAVE_DIR
shutil.copyfile(tekken_file, destination_tekken_file)
# --- End of tokenizer saving ---
print(f"Model saved to {SAVE_DIR}")
print(f"Tokenizer file (tekken.json) copied to {destination_tekken_file}")
```
## Inference Example
This model can be loaded and run with transformers and mistral-common, or for optimized FP8 inference, with vLLM.
### Using transformers and mistral-common (for functional checking, not FP8 optimized)
> [!NOTE]
> The following inference code block has been functionally verified.
> The example was successfully executed within the following Docker container environment on a system with Nvidia RTX 5090 GPU:
>
> ```bash
> # 1. Set your Hugging Face Token
> export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
>
> # 2. Run the Triton Server container with GPU access and necessary privileges
> sudo docker run --gpus all -it --rm \
> --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 \
> -e HF_TOKEN=$HF_TOKEN --net host \
> nvcr.io/nvidia/tritonserver:25.04-trtllm-python-py3
> ```
> Inside the container, the Python script was run after installing necessary packages:
> ```bash
> pip install torch transformers huggingface_hub mistral-common`
> python your_inference_script.py
> ```
```python
import torch
from transformers import AutoModelForCausalLM
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, SystemMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from huggingface_hub import hf_hub_download
MODEL_REPO_ID = "textgeflecht/Devstral-Small-2505-FP8-llmcompressor"
# Load model
# For FP8 inference, specific inference engines like vLLM are needed.
# Transformers will load the weights but might not run them in true FP8.
# device_map="auto" and torch_dtype="auto" are good starting points.
model = AutoModelForCausalLM.from_pretrained(
MODEL_REPO_ID,
device_map="auto",
torch_dtype="auto"
)
# Load tokenizer from the tekken.json file within the repo
tekken_file = hf_hub_download(repo_id=MODEL_REPO_ID, filename="tekken.json")
tokenizer = MistralTokenizer.from_file(tekken_file)
# (Optional) Load System Prompt if your model uses one and it's in the repo
# try:
# system_prompt_file = hf_hub_download(repo_id=MODEL_REPO_ID, filename="SYSTEM_PROMPT.txt")
# with open(system_prompt_file, "r") as f:
# SYSTEM_PROMPT = f.read()
# except Exception: # pylint: disable=broad-except
# SYSTEM_PROMPT = "You are a helpful coding assistant."
# dev specific example:
# prompt = "Write a python function that calculates the factorial of a number."
# quick example:
prompt = "What is the capital of France?"
messages = [
# SystemMessage(content=SYSTEM_PROMPT), # Uncomment if using a system prompt
UserMessage(content=prompt)
]
# Create ChatCompletionRequest
chat_request = ChatCompletionRequest(messages=messages)
# Encode the request
tokenized_payload = tokenizer.encode_chat_completion(chat_request)
input_ids = torch.tensor([tokenized_payload.tokens], device=model.device)
attention_mask = torch.ones_like(input_ids)
# Generate output
# Note: Setting pad_token_id is common for open-ended generation to prevent warnings.
# For Mistral/Devstral models using this tokenizer, the End-Of-Sentence (EOS) token ID is 2,
# which is suitable for use as pad_token_id in this context.
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_new_tokens=200,
pad_token_id=2, # EOS token ID used as PAD token ID
do_sample=True, # Add sampling parameters for more diverse outputs
top_p=0.9,
temperature=0.7
)
# Decode only the generated tokens
generated_tokens = output[0][len(tokenized_payload.tokens):]
decoded_output = tokenizer.decode(generated_tokens.tolist())
print("Original Prompt:\n", prompt)
print("\nGenerated Output:\n", decoded_output)
```
### Using vLLM (for optimized FP8 inference)
This model, quantized to FP8 with `llm-compressor`, is designed for efficient inference with vLLM, especially on newer NVIDIA GPUs.
**Prerequisites:**
* A recent version of vLLM. The author's successful tests used a custom, very recent build of vLLM with specific patches for NVIDIA Blackwell FP8 support.
* A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer architectures are recommended for FP8).
* Docker and NVIDIA Container Toolkit installed.
**Running with Docker (Recommended & Tested by Author):**
The following Docker command starts a vLLM OpenAI-compatible server with this quantized model. This setup has been verified by the author to load the model successfully and serve requests.
```bash
# 1. Set your Hugging Face Token (optional, but recommended to avoid rate limits or for private models)
# export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
# 2. Run the vLLM Docker container.
# Replace 'vllm/vllm-openai:latest' with your specific vLLM image if using a custom build.
# The 'latest' tag should pull a recent official build from vLLM.
sudo docker run --gpus all \
-v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
-p 8000:8000 \
-e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
vllm/vllm-openai:latest \
--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor \
--tokenizer_mode mistral \
--load_format auto \
--max-model-len 2048 # Optional: Limit VRAM usage
# Optional: Add Mistral-specific tool usage flags if needed by your application.
# --tool-call-parser mistral \
# --enable-auto-tool-choice \
# Optional: Explicitly set tensor parallel size if you have multiple GPUs e.g. on 2 GPUs:
# --tensor-parallel-size 2
```
Key Command-Line Arguments Used:
- `--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor`: Specifies the quantized model from the Hugging Face Hub.
- `--tokenizer_mode mistral`: Essential for vLLM to correctly use the MistralTokenizer with the `tekken.json` file from the repository.
- `--load_format auto`: Allows vLLM to auto-detect the Hugging Face sharded safetensors format for weights. With this, vLLM successfully reads the `config.json` (which includes `quantization_config` with `quant_method: "compressed-tensors"`) and auto-detects the FP8 quantization scheme.
- `--max-model-len 2048`: Limits the maximum sequence length (input + output tokens combined) to manage VRAM. Adjust this value based on your needs and available GPU memory.
- The flags `--tool-call-parser mistral` and `--enable-auto-tool-choice` can be added if you intend to use Devstral's tool-calling capabilities.
Note on FP8 Support (especially for newer architectures like Blackwell):
- vLLM's support for FP8, particularly on the newest GPU architectures like NVIDIA Blackwell, is an area of active development.
- The successful tests for this model on Blackwell used a custom, very recent vLLM build with specific patches for Blackwell FP8 support.
- While standard `vllm/vllm-openai:latest` images are updated regularly, cutting-edge hardware support and specific quantization schemes might take time to be fully integrated and stabilized in official releases.
- If you encounter issues related to FP8 performance or compatibility on very new hardware with official vLLM builds, it's recommended to check the vLLM GitHub repository issues and discussions for the latest status, potential workarounds, or information on required builds.
Interacting with the Server:
- Once the vLLM server is running, it exposes an OpenAI-compatible API.
- You can interact with it using any OpenAI client library (like openai for Python) or tools like curl.
- Endpoint for chat completions: `http://localhost:8000/v1/chat/completions`
- Model name in requests: Use `textgeflecht/Devstral-Small-2505-FP8-llmcompressor`
Refer to the Python requests example in the original https://huggingface.co/mistralai/Devstral-Small-2505 model card for client-side interaction, adjusting the URL and model name as needed.
### Original Model Card (mistralai/Devstral-Small-2505)
For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505 |