You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Voxtral-Mini-4B-Realtime-2602 — FP8 (W8A8 dynamic)

FP8-quantized submission for the Audio-to-Text track, built on mistralai/Voxtral-Mini-4B-Realtime-2602 (the base model remains the primary model in inference; this is post-training quantization only, no finetuning).

Approach

Weights are quantized to FP8 E4M3, W8A8 dynamic using the compressed-tensors format:

Static, per-output-channel weight scales (scale = absmax_channel / 448).
Dynamic, per-token activation quantization, computed by vLLM at runtime (no activation scales stored).

This targets the L4's native FP8 matmul (Ada, SM 8.9), giving real compute and memory-bandwidth energy savings — not just a smaller disk footprint.

Quantization is applied directly to the Mistral-native consolidated.safetensors, preserving native key names so vLLM's purpose-built realtime Voxtral loader serves it (--config-format mistral --load-format mistral). No HuggingFace save_pretrained round-trip is used, which would rename submodules and break the native loader.

What is quantized

Decoder attention projections (wq/wk/wv/wo) and MLP (w1/w2/w3) for all 26 decoder layers.
(If the encoder variant was used) the streaming Whisper encoder's transformer attention + MLP Linear weights — the encoder runs on every audio chunk, so this is the dominant inference-energy term for realtime ASR.

What stays BF16

All norms (attention_norm, ffn_norm, top-level norm).
The adaptive-norm conditioning MLP ada_rms_norm_t_cond.{0,2} (Linear-shaped, accuracy-sensitive, negligible compute).
Conv stem and all biases.
The audio↔language connector (audio_language_projection).
Tied token embeddings / lm_head.

Serving config

See vllm_config.yaml. Key choices:

tokenizer_mode/config_format/load_format: mistral — required for the native Voxtral realtime path and the bundled tekken.json tokenizer.
compilation_config: '{"cudagraph_mode":"PIECEWISE"}' and no enforce_eager — enables torch.compile + CUDA graphs, a large energy win within the enforced time budget.
max_model_len left high enough not to truncate eval clips (the FAQ notes too-low values hurt benchmark scores); lower only if clips are confirmed short.
No infra-specific params (no tensor-parallel-size / swap_space / logging flags).

Reproduce

pip install torch safetensors
python quantize_voxtral_native_fp8.py \
  --src-dir <dir with consolidated.safetensors> \
  --save-dir ./Voxtral-FP8\
  [--quantize-encoder]

Known issues / notes for the eval team

Built and validated against vLLM 0.23.0.
If load fails on a per-channel scale shape, the weight scale is stored as [out, 1] float32; the build may expect [out] — squeeze the trailing dim.
pip freeze for the build environment is included for debugging.

Downloads last month: 76

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support