Update README.md

071bfc1 verified 3 months ago

999 Bytes

datasets:
  - HuggingFaceH4/ultrachat_200k
base_model:
  - mistralai/Mistral-Small-24B-Instruct-2501
pipeline_tag: text-generation
tags:
  - mistral
  - quantization
  - vllm
  - nvfp4
license: apache-2.0

Model Card for ealexeev/Mistral-Small-24B-NVFP4

Model Description

This is a compressed version of Mistral Small 24B Instruct, quantized to NVFP4 using llm-compressor.

This model is optimized for NVIDIA Blackwell/Hopper GPUs (H100, B200) and vLLM.

Benchmarks (Run on DGX Spark)

Metric	Base Model (FP16)	This Model (NVFP4)	Delta
HellaSwag (Logic)	83.47%	83.20%	-0.27%
IFEval (Strict)	71.46%	70.50%	-0.96%
Throughput	712 tok/s	1344 tok/s	+1.88x

Usage

⚠️ This model requires vLLM. It will NOT work with GGUF/llama.cpp/Ollama.

vLLM Command

vllm serve ealexeev/Mistral-Small-24B-NVFP4 \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.8 \
    --enforce-eager