Devstral-24B-NVFP4-NVembed

NVFP4-quantized Devstral-Small-2-24B-Instruct — a 24B-parameter dense coding model that fits in 12.4 GB VRAM and runs 31K context on a single RTX 5080.

What's special

Full NVFP4: all Linear layers, lm_head, and embed_tokens quantized to NVFP4
FP8 KV cache: GQA attention (32 Q / 8 KV heads), stored in FP8
31,200-token context on 16 GB VRAM (vs ~20K with BF16 embed)
12.4 GiB single-file model.safetensors
YARN rope scaling (factor=48) for long-context support
Standard MistralForCausalLM architecture — no --trust-remote-code needed

Hardware

GPU: NVIDIA RTX 5080 (16 GB VRAM, SM 12.0 / Blackwell)
NVFP4 GEMM: Marlin backend (FlashInfer FP4 JIT fails on SM 12.0)

Should work on any Blackwell GPU (RTX 50-series, B-series datacenter). Context length scales with available VRAM.

Quick start

1. Clone and build vLLM

git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout 628302114  # v0.16.1rc1.dev34

# Apply patches (required for NVFP4 lm_head and embed_tokens)
git apply ../Devstral-24B-NVFP4-NVembed/vllm_patches.diff

# Build
pip install -e . --no-build-isolation

2. Install dependencies

pip install flashinfer==0.6.4  # or latest compatible

3. Serve

export VLLM_TEST_FORCE_FP8_MARLIN=1
export VLLM_NVFP4_GEMM_BACKEND=marlin
export PYTORCH_ALLOC_CONF=expandable_segments:True
export VLLM_FLASHINFER_WORKSPACE_BUFFER_SIZE=$((64 * 1024 * 1024))

vllm serve ./Devstral-24B-NVFP4-NVembed \
    --dtype bfloat16 \
    --quantization modelopt \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 31200 \
    --no-enable-prefix-caching \
    --max-num-seqs 1 \
    --kv-cache-memory-bytes $((1950 * 1310720)) \
    --num-gpu-blocks-override 1950 \
    --tensor-parallel-size 1 \
    --enable-chunked-prefill \
    --max-num-batched-tokens 256 \
    --enforce-eager \
    --override-generation-config '{"temperature": 0.0, "max_tokens": 8000}'

4. Test

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Devstral-24B-NVFP4-NVembed",
       "messages": [{"role": "user", "content": "Write a Python function to check if a number is prime."}]}' \
  | python3 -m json.tool

Serve command flags explained

Flag	Why
`--quantization modelopt`	NVFP4 weight format from NVIDIA Model Optimizer
`--kv-cache-dtype fp8_e4m3`	FP8 KV cache — halves KV memory vs BF16
`--no-enable-prefix-caching`	Disabled for maximum VRAM headroom
`--max-num-seqs 1`	Single sequence — 16 GB VRAM budget
`--kv-cache-memory-bytes`	Precise KV allocation (bypasses `gpu_memory_utilization` startup check)
`--num-gpu-blocks-override 1950`	Binary-searched max stable blocks for RTX 5080
`--enforce-eager`	CUDA graphs disabled — saves VRAM
`--override-generation-config`	`temperature=0` for deterministic coding output

Performance

Metric	Value
Model size	12.4 GiB
Max context	31,200 tokens
KV cache blocks	1,950
KV cache size	~2.4 GiB
KV per token	80 KB (40 layers x 8 KV heads x 128 dim x 2 x 1 byte)
Architecture	Dense, 40 layers, GQA (32 Q / 8 KV heads)
Parameters	24B

Quantization details

Method: NVIDIA Model Optimizer (nvidia-modelopt==0.41.0)
Format: NVFP4 (4-bit floating point, group_size=16)
Calibration: 512 samples across 4 task types (code, instruction, agentic, structured data from databricks-dolly-15k)
lm_head + embed_tokens: also NVFP4 (saves ~1.9 GB combined)
Source model: Dequantized from the original FP8 checkpoint to BF16, then quantized to NVFP4
Base model: mistralai/Devstral-Small-2-24B-Instruct

vLLM patches

The included vllm_patches.diff modifies 3 files based on vLLM commit 628302114:

File	Change
`modelopt.py`	NVFP4 `lm_head` + `embed_tokens` support; `PerTensorScaleParameter` init fix (uninitialized slots caused inf in merged linears)
`vocab_parallel_embedding.py`	FP8 weight dtype preservation + BF16 cast in `embedding()`
`linear.py`	FP8 weight cast in forward pass

Upstream PRs

Some of these patches have been submitted upstream and may become unnecessary in future vLLM releases:

#35576 — MLA weight access crash fix for quantized layers
#35660 — NVFP4-quantized lm_head support

Downloads last month: 6

Safetensors

Model size

13B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support