Voxtral-Mini-4B-Realtime-2602 — FP8 (W8A8 dynamic)
FP8-quantized submission for the Audio-to-Text track, built on
mistralai/Voxtral-Mini-4B-Realtime-2602 (the base model remains the primary
model in inference; this is post-training quantization only, no finetuning).
Approach
Weights are quantized to FP8 E4M3, W8A8 dynamic using the compressed-tensors
format:
- Static, per-output-channel weight scales (
scale = absmax_channel / 448). - Dynamic, per-token activation quantization, computed by vLLM at runtime (no activation scales stored).
This targets the L4's native FP8 matmul (Ada, SM 8.9), giving real compute and memory-bandwidth energy savings — not just a smaller disk footprint.
Quantization is applied directly to the Mistral-native consolidated.safetensors,
preserving native key names so vLLM's purpose-built realtime Voxtral loader serves
it (--config-format mistral --load-format mistral). No HuggingFace
save_pretrained round-trip is used, which would rename submodules and break the
native loader.
What is quantized
- Decoder attention projections (
wq/wk/wv/wo) and MLP (w1/w2/w3) for all 26 decoder layers. - (If the encoder variant was used) the streaming Whisper encoder's transformer attention + MLP Linear weights — the encoder runs on every audio chunk, so this is the dominant inference-energy term for realtime ASR.
What stays BF16
- All norms (
attention_norm,ffn_norm, top-levelnorm). - The adaptive-norm conditioning MLP
ada_rms_norm_t_cond.{0,2}(Linear-shaped, accuracy-sensitive, negligible compute). - Conv stem and all biases.
- The audio↔language connector (
audio_language_projection). - Tied token embeddings /
lm_head.
Serving config
See vllm_config.yaml. Key choices:
tokenizer_mode/config_format/load_format: mistral— required for the native Voxtral realtime path and the bundledtekken.jsontokenizer.compilation_config: '{"cudagraph_mode":"PIECEWISE"}'and noenforce_eager— enables torch.compile + CUDA graphs, a large energy win within the enforced time budget.max_model_lenleft high enough not to truncate eval clips (the FAQ notes too-low values hurt benchmark scores); lower only if clips are confirmed short.- No infra-specific params (no tensor-parallel-size / swap_space / logging flags).
Reproduce
pip install torch safetensors
python quantize_voxtral_native_fp8.py \
--src-dir <dir with consolidated.safetensors> \
--save-dir ./Voxtral-FP8\
[--quantize-encoder]
Known issues / notes for the eval team
- Built and validated against vLLM 0.23.0.
- If load fails on a per-channel scale shape, the weight scale is stored as
[out, 1]float32; the build may expect[out]— squeeze the trailing dim. pip freezefor the build environment is included for debugging.
- Downloads last month
- 76