--- datasets: - HuggingFaceH4/ultrachat_200k base_model: - mistralai/Mistral-Small-24B-Instruct-2501 pipeline_tag: text-generation tags: - mistral - quantization - vllm - nvfp4 license: apache-2.0 --- # Model Card for ealexeev/Mistral-Small-24B-NVFP4 ## Model Description This is a compressed version of Mistral Small 24B Instruct, quantized to **NVFP4** using `llm-compressor`. **This model is optimized for NVIDIA Blackwell/Hopper GPUs (H100, B200) and vLLM.** ## Benchmarks (Run on DGX Spark) | Metric | Base Model (FP16) | This Model (NVFP4) | Delta | | :--- | :--- | :--- | :--- | | **HellaSwag (Logic)** | 83.47% | 83.20% | -0.27% | | **IFEval (Strict)** | 71.46% | 70.50% | -0.96% | | **Throughput** | 712 tok/s | 1344 tok/s | **+1.88x** | ## Usage **⚠️ This model requires vLLM. It will NOT work with GGUF/llama.cpp/Ollama.** ### vLLM Command ```bash vllm serve ealexeev/Mistral-Small-24B-NVFP4 \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.8 \ --enforce-eager