You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Voxtral-Mini-4B-Realtime, FP8 dynamic (Round 2)

Compressed variant of mistralai/Voxtral-Mini-4B-Realtime-2602 submitted to the Resilient AI Challenge, Audio-to-Text category.

The compression is performed by vLLM at server startup through the engine's built-in quantization: fp8 pathway. There are no offline weight modifications: the engine loads the original Mistral consolidated checkpoint and quantizes the language-model Linear layers to FP8 in memory before serving.

Launch command used by the organizers (as-is):

vllm serve --config vllm_config.yaml

Method

Quantization scheme

  • FP8 (E4M3) dynamic quantization on the language decoder Linear layers.
  • Weights: per-tensor scales computed at load time (round-to-nearest from the bf16 master weights).
  • Activations: per-token, computed dynamically at each forward pass.
  • The audio encoder (audio_tower.*) and the multimodal projector (multi_modal_projector.*) are left in bf16.
  • lm_head is left in bf16 (vLLM default for quantization: fp8).

What is not changed

  • Architecture: identical to the base model (same config, layers, heads, audio encoder, delay attention scheduling).
  • Tokenizer: identical (Mistral tekken, tokenizer_mode: mistral).
  • max_model_len: 45000, kept as in the baseline. The organizers warned that decreasing it affects benchmark performance.
  • No fine-tuning, no distillation, no architectural change.

Why FP8 dynamic and not a more aggressive scheme

I explored several compression paths for Round 2 before settling on FP8 dynamic. The full write-ups and logs are in exploration/. In short:

  • GPTQ W4A16: abandoned. llm-compressor's FX tracing of the Voxtral forward crashed inside the audio encoder before calibration could run.
  • AWQ W4A16: abandoned. Calibration completed, but the reassembled checkpoint failed to load on vLLM with a chain of missing encoder-config attributes that have no Voxtral equivalent.
  • KV cache FP8: blocked. It requires the FlashInfer backend, which is not available on the Voxtral attention path.
  • MXFP8: works and loads cleanly, but it is slower and ends up consuming more absolute energy than FP8 on the full suite (see the comparison below). Since the challenge ranks on absolute energy, FP8 is the better submission.

FP8 dynamic loads cleanly, passes the quality gate on all 26 language by corpus pairs, and gives the lowest absolute energy of the working methods.

Results (internal measurements)

Hardware and stack for these numbers: NVIDIA RTX 4090 (24 GB, compute capability 8.9), single GPU isolated via CUDA_VISIBLE_DEVICES=2, vLLM 0.22, CodeCarbon in process tracking mode with gpu_ids pinned to the visible device to isolate the run from other users on the shared host.

Quality is measured on all 13 Voxtral languages across 2 standard ASR corpora (FLEURS, Common Voice 17). For Japanese and Chinese I report CER, because WER is misleading on references with no inter-word spaces.

Quality (per-language, FP8 vs bf16 baseline)

dataset metric base FP8 Δrel
fleurs/en WER 8.90% 8.31% -6.7%
fleurs/fr WER 9.50% 9.39% -1.2%
fleurs/de WER 5.98% 5.47% -8.6%
fleurs/es WER 3.00% 3.58% +19.2%
fleurs/it WER 4.60% 4.71% +2.5%
fleurs/pt WER 4.41% 4.25% -3.6%
fleurs/nl WER 8.59% 9.03% +5.2%
fleurs/hi WER 17.72% 17.13% -3.3%
fleurs/ja CER 11.66% 11.60% -0.5%
fleurs/ko WER 13.96% 13.89% -0.5%
fleurs/zh CER 51.24% 51.34% +0.2%
fleurs/ar WER 16.05% 16.99% +5.9%
fleurs/ru WER 5.45% 5.61% +2.9%
cv/en WER 15.56% 15.88% +2.1%
cv/fr WER 10.96% 10.40% -5.2%
cv/de WER 8.64% 9.97% +15.4%
cv/es WER 6.51% 6.72% +3.2%
cv/it WER 6.47% 6.89% +6.3%
cv/pt WER 10.06% 10.73% +6.6%
cv/nl WER 10.94% 11.16% +2.0%
cv/hi WER 19.38% 20.04% +3.4%
cv/ja CER 25.37% 21.50% -15.2%
cv/ko WER 29.23% 29.23% +0.0%
cv/zh CER 26.97% 30.11% +11.6%
cv/ar WER 58.47% 60.49% +3.5%
cv/ru WER 8.29% 8.06% -2.8%

Quality verdict: 26 of 26 language by corpus pairs pass both readings of the 80% threshold.

  • Strict reading (variant_err <= 1.25 * base_err): 26 / 26 PASS.
  • Permissive reading ((1 - variant_err) >= 0.80 * (1 - base_err)): 26 / 26 PASS.

The Hindi pair (the only Voxtral-supported indic language, my proxy for the indic benchmarks added by the organizers) is stable on both corpora.

Energy (full benchmark suite, 26 language by corpus pairs)

metric baseline (bf16) FP8 ratio
Total energy (CodeCarbon) 570.9 Wh 515.0 Wh 0.902x
GPU energy only (NVML) 530.3 Wh 469.6 Wh 0.885x
Generation tokens (server-side) 291,991 304,137 1.042x
Wall-clock duration 13,813 s 15,834 s 1.146x

Energy verdict: 9.8% total energy savings, 11.5% GPU-only savings, with the variant generating 4.2% more tokens than the baseline.

The GPU-only ratio (0.885x) is the more relevant figure for the L4 evaluation hardware: the L4 host CPU is much smaller than the Threadripper PRO 5995WX of my dev machine, so the CPU share of total energy will be smaller on the L4 and the ratio should land closer to the GPU-only number.

Comparison with MXFP8 (why it was not submitted)

I also ran a full MXFP8 evaluation on the same stack (vLLM 0.22, same suite, same protocol). MXFP8 quantizes more aggressively and has a smaller loaded footprint (5.26 GiB versus 8.43 GiB for bf16), and it passes the quality gate on all 26 pairs as well. But it loses on the criterion that decides the ranking, absolute energy:

metric (26 pairs) baseline FP8 MXFP8
Total energy 570.9 Wh 515.0 Wh 548.7 Wh
Savings vs baseline reference 9.8% 3.9%
Wall-clock duration 3.84 h 4.40 h 5.83 h

MXFP8 consumes 6.5% more absolute energy than FP8 and is markedly slower (5.83 h versus 4.40 h). The Marlin MXFP8 GEMM kernel is slower than the native FP8 path, and the extra GPU time outweighs the more aggressive weight compression in terms of total energy. The smaller memory footprint does not affect the ranking, which is based on energy. For these reasons I submit FP8 and keep MXFP8 documented as an explored alternative in exploration/.

(The MXFP8 total above sums two measured runs on the same stack: the main run covering 24 language by corpus pairs plus a follow-up run covering Russian, which the main run had not yet evaluated.)

How to serve this model

The submission contains the original Mistral weights, vllm_config.yaml, and this README. The config:

model: mistralai/Voxtral-Mini-4B-Realtime-2602
tokenizer_mode: mistral
config_format: mistral
load_format: mistral
trust_remote_code: true
quantization: fp8
max_model_len: 45000
gpu_memory_utilization: 0.80
compilation_config:
  cudagraph_mode: PIECEWISE

vLLM reads quantization: fp8, loads the language decoder Linear weights and quantizes them to FP8 (E4M3) in memory, keeps the audio encoder and lm_head in bf16, then serves the OpenAI-compatible Realtime WebSocket endpoint.

Known issues at launch

Two patches are needed in the venv before serving Voxtral Realtime on this stack, including the bf16 baseline (they are not specific to FP8). Both are in patches_required/ with shell scripts that apply them:

  1. transformers/tokenization_mistral_common.py: widen the kwargs whitelist.
  2. vllm/transformers_utils/processors/voxtral.py: add a fetch_audio method to MistralCommonFeatureExtractor.

Hardware compatibility

FP8 (E4M3) tensor operations require compute capability 8.9 (Ada Lovelace) or higher. The NVIDIA L4 used for evaluation is compute capability 8.9 and supports the required kernels natively.

License

Apache 2.0, identical to the base model.

Acknowledgements

  • Mistral AI for releasing Voxtral-Mini-4B-Realtime-2602.
  • The vLLM project for the FP8 dynamic quantization pathway and the Mistral checkpoint loader.
Downloads last month
53
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ltl1605/voxtral-fp8-round2