textgeflecht commited on
Commit
4221bd9
·
verified ·
1 Parent(s): 5b17ebd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +52 -62
README.md CHANGED
@@ -209,72 +209,62 @@ print("\nGenerated Output:\n", decoded_output)
209
  ```
210
 
211
  ### Using vLLM (for optimized FP8 inference)
212
- > [!NOTE]
213
- > The following vLLM inference example is provided as a general guideline based on `llm-compressor`'s intended compatibility with vLLM for FP8 models. However, at the time of writing, this specific quantized model (`textgeflecht/Devstral-Small-2505-FP8-llmcompressor`) with its `MistralTokenizer` (using `tekken.json`) **has not been explicitly tested with vLLM by the author.**
214
- >
215
- > Successfully running this model with vLLM might require:
216
- > * The latest version of vLLM with robust support for `compressed-tensors` format and custom tokenizers.
217
- > * Potential adjustments to tokenizer loading or prompt formatting to align with vLLM's expectations for models using `MistralTokenizer`.
218
- > * Consulting the [vLLM documentation](https://docs.vllm.ai/en/latest/) for the most up-to-date instructions on running quantized models and handling custom tokenizers.
219
- >
220
- > Users are encouraged to experiment and consult vLLM resources if they encounter issues.
221
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
 
223
- Ensure you have vllm installed (pip install vllm).
224
- This requires a compatible GPU (NVIDIA Ada/Hopper/Blackwell or newer architecture).
225
-
226
- ```python
227
- from vllm import LLM, SamplingParams
228
-
229
- # Define your model ID
230
- MODEL_REPO_ID = "textgeflecht/Devstral-Small-2505-FP8-llmcompressor"
231
 
232
- # Initialize vLLM with the model
233
- # vLLM should automatically detect the FP8 quantization if the format is compatible.
234
- # The 'quantization' parameter might be needed if auto-detection fails,
235
- # often models from llm-compressor are saved in 'compressed-tensors' format
236
- # which vLLM aims to support. You might need to specify tokenizer_mode='custom'
237
- # and point to the tekken.json or use a HuggingFace compatible tokenizer wrapper if needed.
238
-
239
- # For MistralTokenizer, vLLM might require manual handling or a specific setup.
240
- # If vLLM doesn't directly support MistralTokenizer from tekken.json,
241
- # you might need to use a HF AutoTokenizer equivalent IF one can be made compatible,
242
- # or wait for broader custom tokenizer support in vLLM.
243
-
244
- # Simplest case (assuming vLLM handles the compressed-tensors format and tokenizer):
245
- llm = LLM(model=MODEL_REPO_ID, trust_remote_code=True)
246
- # If tekken.json is present, vLLM might use it if it knows the tokenizer type.
247
- # Or, if a compatible tokenizer_config.json can be created for MistralTokenizer.
248
-
249
- # For now, a more robust way with vLLM and custom tokenizers like Mistral's
250
- # might involve more direct specification or ensuring the saved format is fully
251
- # vLLM auto-detectable. If issues arise, consulting vLLM documentation for custom/FP8 models is key.
252
-
253
- # --- Placeholder for vLLM with MistralTokenizer ---
254
- # Direct vLLM usage with MistralTokenizer from a tekken.json might be tricky
255
- # as vLLM typically relies on Hugging Face AutoTokenizer.
256
- # One approach could be to use the HF AutoModelForCausalLM loading mechanism
257
- # and then pass it to vLLM if supported, or ensure the saved format is directly consumable.
258
-
259
- # Let's assume for now a standard prompt for demonstration if vLLM picks up a compatible tokenizer.
260
- # If not, you'd use MistralTokenizer to prepare the prompt as in the transformers example.
261
-
262
- prompts = [
263
- "[INST] Write a python function that calculates the factorial of a number. [/INST]",
264
- # Add more prompts if needed
265
- ]
266
 
267
- sampling_params = SamplingParams(temperature=0.7, top_p=0.9, max_tokens=200)
 
 
 
 
268
 
269
- outputs = llm.generate(prompts, sampling_params)
270
-
271
- # Print the outputs
272
- for output in outputs:
273
- prompt = output.prompt
274
- generated_text = output.outputs[0].text
275
- print(f"Prompt: {prompt!r}")
276
- print(f"Generated text: {generated_text!r}\n---")
277
- ```
278
 
279
- Original Model Card (mistralai/Devstral-Small-2505)
280
  For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505
 
209
  ```
210
 
211
  ### Using vLLM (for optimized FP8 inference)
 
 
 
 
 
 
 
 
 
212
 
213
+ This model, quantized to FP8 with `llm-compressor`, is designed for efficient inference with vLLM, especially on newer NVIDIA GPUs.
214
+
215
+ **Prerequisites:**
216
+ * A recent version of vLLM. The author's successful tests used a custom, very recent build of vLLM with specific patches for NVIDIA Blackwell FP8 support.
217
+ * A compatible NVIDIA GPU (Ada Lovelace, Hopper, Blackwell, or newer architectures are recommended for FP8).
218
+ * Docker and NVIDIA Container Toolkit installed.
219
+
220
+ **Running with Docker (Recommended & Tested by Author):**
221
+
222
+ The following Docker command starts a vLLM OpenAI-compatible server with this quantized model. This setup has been verified by the author to load the model successfully and serve requests.
223
+
224
+ ```bash
225
+ # 1. Set your Hugging Face Token (optional, but recommended to avoid rate limits or for private models)
226
+ # export HF_TOKEN="YOUR_HUGGINGFACE_ACCESS_TOKEN_HERE"
227
+
228
+ # 2. Run the vLLM Docker container.
229
+ # Replace 'vllm/vllm-openai:latest' with your specific vLLM image if using a custom build.
230
+ # The 'latest' tag should pull a recent official build from vLLM.
231
+ sudo docker run --gpus all \
232
+ -v "$HOME/.cache/huggingface:/root/.cache/huggingface" \
233
+ -p 8000:8000 \
234
+ -e HUGGING_FACE_HUB_TOKEN="$HUGGING_FACE_HUB_TOKEN" \
235
+ vllm/vllm-openai:latest \
236
+ --model textgeflecht/Devstral-Small-2505-FP8-llmcompressor \
237
+ --tokenizer_mode mistral \
238
+ --load_format auto \
239
+ --max-model-len 2048 # Optional: Limit VRAM usage
240
+
241
+ # Optional: Add Mistral-specific tool usage flags if needed by your application.
242
+ # --tool-call-parser mistral \
243
+ # --enable-auto-tool-choice \
244
+ # Optional: Explicitly set tensor parallel size if you have multiple GPUs e.g. on 2 GPUs:
245
+ # --tensor-parallel-size 2
246
+ ```
247
+ Key Command-Line Arguments Used:
248
 
249
+ - `--model textgeflecht/Devstral-Small-2505-FP8-llmcompressor`: Specifies the quantized model from the Hugging Face Hub.
250
+ - `--tokenizer_mode mistral`: Essential for vLLM to correctly use the MistralTokenizer with the `tekken.json` file from the repository.
251
+ - `--load_format auto`: Allows vLLM to auto-detect the Hugging Face sharded safetensors format for weights. With this, vLLM successfully reads the `config.json` (which includes `quantization_config` with `quant_method: "compressed-tensors"`) and auto-detects the FP8 quantization scheme.
252
+ - `--max-model-len 2048`: Limits the maximum sequence length (input + output tokens combined) to manage VRAM. Adjust this value based on your needs and available GPU memory.
253
+ - The flags `--tool-call-parser mistral` and `--enable-auto-tool-choice` can be added if you intend to use Devstral's tool-calling capabilities.
 
 
 
254
 
255
+ Note on FP8 Support (especially for newer architectures like Blackwell):
256
+ - vLLM's support for FP8, particularly on the newest GPU architectures like NVIDIA Blackwell, is an area of active development.
257
+ - The successful tests for this model on Blackwell used a custom, very recent vLLM build with specific patches for Blackwell FP8 support.
258
+ - While standard `vllm/vllm-openai:latest` images are updated regularly, cutting-edge hardware support and specific quantization schemes might take time to be fully integrated and stabilized in official releases.
259
+ - If you encounter issues related to FP8 performance or compatibility on very new hardware with official vLLM builds, it's recommended to check the vLLM GitHub repository issues and discussions for the latest status, potential workarounds, or information on required builds.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
260
 
261
+ Interacting with the Server:
262
+ - Once the vLLM server is running, it exposes an OpenAI-compatible API.
263
+ - You can interact with it using any OpenAI client library (like openai for Python) or tools like curl.
264
+ - Endpoint for chat completions: `http://localhost:8000/v1/chat/completions`
265
+ - Model name in requests: Use `textgeflecht/Devstral-Small-2505-FP8-llmcompressor`
266
 
267
+ Refer to the Python requests example in the original https://huggingface.co/mistralai/Devstral-Small-2505 model card for client-side interaction, adjusting the URL and model name as needed.
 
 
 
 
 
 
 
 
268
 
269
+ ### Original Model Card (mistralai/Devstral-Small-2505)
270
  For more details on the base model, please refer to the original model card: https://huggingface.co/mistralai/Devstral-Small-2505