| | --- |
| | datasets: |
| | - HuggingFaceH4/ultrachat_200k |
| | base_model: |
| | - mistralai/Mistral-Small-24B-Instruct-2501 |
| | pipeline_tag: text-generation |
| | tags: |
| | - mistral |
| | - quantization |
| | - vllm |
| | - nvfp4 |
| | license: apache-2.0 |
| | --- |
| | # Model Card for ealexeev/Mistral-Small-24B-NVFP4 |
| |
|
| | ## Model Description |
| | This is a compressed version of Mistral Small 24B Instruct, quantized to **NVFP4** using `llm-compressor`. |
| |
|
| | **This model is optimized for NVIDIA Blackwell/Hopper GPUs (H100, B200) and vLLM.** |
| |
|
| | ## Benchmarks (Run on DGX Spark) |
| | | Metric | Base Model (FP16) | This Model (NVFP4) | Delta | |
| | | :--- | :--- | :--- | :--- | |
| | | **HellaSwag (Logic)** | 83.47% | 83.20% | -0.27% | |
| | | **IFEval (Strict)** | 71.46% | 70.50% | -0.96% | |
| | | **Throughput** | 712 tok/s | 1344 tok/s | **+1.88x** | |
| |
|
| | ## Usage |
| | **⚠️ This model requires vLLM. It will NOT work with GGUF/llama.cpp/Ollama.** |
| |
|
| | ### vLLM Command |
| | ```bash |
| | vllm serve ealexeev/Mistral-Small-24B-NVFP4 \ |
| | --tensor-parallel-size 1 \ |
| | --gpu-memory-utilization 0.8 \ |
| | --enforce-eager |