Voxtral-Mini-4B-Realtime — FP8 Dynamic
Compressed variant of mistralai/Voxtral-Mini-4B-Realtime-2602
submitted to the Resilient AI Challenge (Audio-to-Text category, Mistral AI).
The compression is performed by vLLM at server startup, using the
engine's built-in quantization: fp8 pathway. There are no offline
weight modifications: the inference engine loads the original Mistral
consolidated checkpoint published by Mistral AI and quantizes the
language-model Linear layers to FP8 in memory before serving.
Method
Quantization scheme
- FP8 (E4M3) dynamic quantization on the language model decoder Linear layers
- Weights: symmetric, per-tensor scales computed at load time (round-to-nearest from the bf16 master weights)
- Activations: symmetric, per-token, computed dynamically at every forward pass
- The audio encoder (
audio_tower.*) and the multimodal projector (multi_modal_projector.*) are left in bf16. This is the same protection RedHat applies in their own FP8 variant of Voxtral-Mini-3B-2507. lm_headis also left in bf16 (vLLM default forquantization: fp8).
Why vLLM-native FP8 rather than offline llm-compressor
We tested both pathways. The offline route (llm-compressor 0.10 + transformers
nightly + HF safetensors with compressed-tensors quantization config) works
in our development environment but requires multiple monkey-patches to load on
vLLM 0.17.1 + transformers 4.57.6, the stable stack used for evaluation.
The vLLM-native path produces a mathematically equivalent FP8 dynamic quantization (RTN weights + dynamic activations) but loads cleanly on the unmodified evaluation stack. For Round 1 we prioritise robustness; if a more aggressive scheme is needed for Round 2 (AWQ W4A16, INT8 calibrated, KV-cache FP8), we will revisit this trade-off.
What is not changed
- Architecture: identical to
mistralai/Voxtral-Mini-4B-Realtime-2602(same config, same number of layers, same heads, same audio encoder, same delay attention scheduling) - Tokenizer: identical (Mistral tekken JSON,
tokenizer_mode: mistral) max_model_len: 45 000 (kept as in the baseline; the organizers warned that decreasing it affects benchmark performance)- No fine-tuning, no distillation, no architectural change
Results (internal measurements)
Hardware used for these numbers: NVIDIA RTX 4090 (24 GB), Ada Lovelace,
compute capability 8.9, vLLM 0.17.1, transformers 4.57.6,
CodeCarbon 2.8.4 in process tracking mode with
gpu_ids=[CUDA_VISIBLE_DEVICES] to isolate the run from other GPU users.
Quality is measured on all 13 Voxtral languages × 2 standard ASR corpora (FLEURS, Common Voice 17), with Whisper-style text normalization before WER/CER. Energy is measured over the full benchmark suite as a single continuous window.
Quality (per-language)
For Japanese and Chinese we report CER (WER is misleading because the references have no inter-word spaces).
| dataset | metric | base | variant | Δ |
|---|---|---|---|---|
| fleurs/en | WER | 9.20% | 8.46% | -8.1% |
| fleurs/fr | WER | 9.09% | 9.35% | +2.9% |
| fleurs/de | WER | 5.85% | 5.81% | -0.7% |
| fleurs/es | WER | 3.00% | 3.74% | +24.4% |
| fleurs/it | WER | 4.56% | 4.63% | +1.7% |
| fleurs/pt | WER | 4.46% | 4.50% | +0.9% |
| fleurs/nl | WER | 8.76% | 8.58% | -2.0% |
| fleurs/hi | WER | 17.74% | 18.31% | +3.2% |
| fleurs/ja | CER | 11.66% | 11.68% | +0.2% |
| fleurs/ko | WER | 13.96% | 14.03% | +0.5% |
| fleurs/zh | CER | 51.20% | 51.38% | +0.4% |
| fleurs/ar | WER | 16.05% | 17.09% | +6.5% |
| fleurs/ru | WER | 5.56% | 5.83% | +4.8% |
| cv/en | WER | 15.88% | 16.53% | +4.1% |
| cv/fr | WER | 10.96% | 10.87% | -0.9% |
| cv/de | WER | 9.63% | 9.19% | -4.6% |
| cv/es | WER | 6.72% | 6.72% | 0.0% |
| cv/it | WER | 6.06% | 6.27% | +3.4% |
| cv/pt | WER | 12.60% | 12.44% | -1.2% |
| cv/nl | WER | 10.60% | 10.60% | 0.0% |
| cv/hi | WER | 20.75% | 20.75% | 0.0% |
| cv/ja | CER | 22.60% | 20.72% | -8.3% |
| cv/ko | WER | 28.85% | 29.10% | +0.9% |
| cv/zh | CER | 27.50% | 28.70% | +4.4% |
| cv/ar | WER | 67.29% | 67.87% | +0.9% |
| cv/ru | WER | 8.18% | 7.94% | -2.9% |
Quality verdict: 26 of 26 language × corpus pairs pass both readings of the 80% threshold:
- Strict reading (
variant_err ≤ 1.25 × base_err): 26 / 26 PASS - Permissive reading (
(1 − variant_err) ≥ 0.80 × (1 − base_err)): 26 / 26 PASS
The Hindi canary (the only Voxtral-supported indic language, and our proxy for the indic benchmarks added by the organizers) is stable on both corpora.
Verbosity (per-language)
We track the ratio of hypothesis characters to reference characters per language as an indicator of hallucinations and run-on transcriptions. Since the challenge ranks on absolute energy and a more verbose model costs more tokens to generate, this metric is part of our health check.
The variant's verbosity ratio stays within ±5% of baseline on every language × corpus pair. No drift, no hallucination.
Energy (full benchmark suite)
| metric | baseline (bf16) | variant (FP8) | ratio |
|---|---|---|---|
| Total energy (CodeCarbon) | 1074.2 Wh | 878.5 Wh | 0.818× |
| GPU energy only (NVML) | 473 Wh | 370 Wh | 0.782× |
Total tokens generated (server-side, vLLM generation_tokens_total) |
292 039 | 296 617 | 1.016× |
| Wall-clock duration | 15 349 s | 12 976 s | 0.846× |
Tracked GPU: a single physical GPU 1 (CUDA_VISIBLE_DEVICES=1),
isolated via CodeCarbon's gpu_ids parameter to avoid contamination
from neighbouring jobs on the same multi-GPU host.
Energy verdict: 18.2% total energy savings, 21.8% GPU-only savings, with the variant generating only 1.6% more tokens than baseline (well under the 10% threshold we set as a "verbosity drift" red flag).
The GPU-only ratio (0.782×) is the more relevant figure for the L4 evaluation hardware the organizers use: the L4 host CPU is much smaller than our Threadripper PRO 5995WX, so its share of total energy will be correspondingly smaller. We expect the L4 ratio to land closer to the GPU-only number than to the full-stack number.
How to serve this model
The submission contains only this README.md and vllm_config.yaml. The
weights themselves are pulled from the upstream Mistral repository at
load time.
Reproduce the evaluation
# Using the exact stack the organizers run :
# vllm 0.17.1, no transformers patches needed
vllm serve \
mistralai/Voxtral-Mini-4B-Realtime-2602 \
--config vllm_config.yaml
vLLM will:
- Download the original Mistral consolidated checkpoint (~8 GB) if not already cached.
- Read the
quantization: fp8flag from the config. - Load Linear weights of the language model decoder, quantize them to FP8 (E4M3) in memory, keep the audio encoder and lm_head in bf16.
- Compile the CUDA graphs and serve the OpenAI-compatible Realtime
WebSocket endpoint at
/v1/realtime.
Total VRAM footprint at runtime: approximately 4.5 GB for the FP8
language model decoder, plus the bf16 audio encoder and projector,
plus the KV cache up to max_model_len=45000. This fits comfortably
on an L4 16 GB at gpu_memory_utilization=0.80.
Hardware compatibility
FP8 (E4M3) tensor operations require a GPU with compute capability ≥ 8.9 (Ada Lovelace) or ≥ 9.0 (Hopper). The NVIDIA L4 used for evaluation has compute capability 8.9 and supports the required kernels natively.
License
Apache 2.0, identical to the base model mistralai/Voxtral-Mini-4B-Realtime-2602.
Acknowledgements
- Mistral AI for releasing Voxtral-Mini-4B-Realtime-2602
- The vLLM project for the FP8 dynamic quantization pathway and the Mistral checkpoint loader
- RedHat for publishing the FP8 recipe of Voxtral-Mini-3B-2507, which validated the choice of leaving the audio encoder in bf16
Model tree for SuperJerem/voxtral-coalition
Base model
mistralai/Ministral-3-3B-Base-2512