You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Voxtral-Mini-4B-Realtime — FP8 Dynamic

Compressed variant of mistralai/Voxtral-Mini-4B-Realtime-2602 submitted to the Resilient AI Challenge (Audio-to-Text category, Mistral AI).

The compression is performed by vLLM at server startup, using the engine's built-in quantization: fp8 pathway. There are no offline weight modifications: the inference engine loads the original Mistral consolidated checkpoint published by Mistral AI and quantizes the language-model Linear layers to FP8 in memory before serving.

Method

Quantization scheme

  • FP8 (E4M3) dynamic quantization on the language model decoder Linear layers
  • Weights: symmetric, per-tensor scales computed at load time (round-to-nearest from the bf16 master weights)
  • Activations: symmetric, per-token, computed dynamically at every forward pass
  • The audio encoder (audio_tower.*) and the multimodal projector (multi_modal_projector.*) are left in bf16. This is the same protection RedHat applies in their own FP8 variant of Voxtral-Mini-3B-2507.
  • lm_head is also left in bf16 (vLLM default for quantization: fp8).

Why vLLM-native FP8 rather than offline llm-compressor

We tested both pathways. The offline route (llm-compressor 0.10 + transformers nightly + HF safetensors with compressed-tensors quantization config) works in our development environment but requires multiple monkey-patches to load on vLLM 0.17.1 + transformers 4.57.6, the stable stack used for evaluation.

The vLLM-native path produces a mathematically equivalent FP8 dynamic quantization (RTN weights + dynamic activations) but loads cleanly on the unmodified evaluation stack. For Round 1 we prioritise robustness; if a more aggressive scheme is needed for Round 2 (AWQ W4A16, INT8 calibrated, KV-cache FP8), we will revisit this trade-off.

What is not changed

  • Architecture: identical to mistralai/Voxtral-Mini-4B-Realtime-2602 (same config, same number of layers, same heads, same audio encoder, same delay attention scheduling)
  • Tokenizer: identical (Mistral tekken JSON, tokenizer_mode: mistral)
  • max_model_len: 45 000 (kept as in the baseline; the organizers warned that decreasing it affects benchmark performance)
  • No fine-tuning, no distillation, no architectural change

Results (internal measurements)

Hardware used for these numbers: NVIDIA RTX 4090 (24 GB), Ada Lovelace, compute capability 8.9, vLLM 0.17.1, transformers 4.57.6, CodeCarbon 2.8.4 in process tracking mode with gpu_ids=[CUDA_VISIBLE_DEVICES] to isolate the run from other GPU users.

Quality is measured on all 13 Voxtral languages × 2 standard ASR corpora (FLEURS, Common Voice 17), with Whisper-style text normalization before WER/CER. Energy is measured over the full benchmark suite as a single continuous window.

Quality (per-language)

For Japanese and Chinese we report CER (WER is misleading because the references have no inter-word spaces).

dataset metric base variant Δ
fleurs/en WER 9.20% 8.46% -8.1%
fleurs/fr WER 9.09% 9.35% +2.9%
fleurs/de WER 5.85% 5.81% -0.7%
fleurs/es WER 3.00% 3.74% +24.4%
fleurs/it WER 4.56% 4.63% +1.7%
fleurs/pt WER 4.46% 4.50% +0.9%
fleurs/nl WER 8.76% 8.58% -2.0%
fleurs/hi WER 17.74% 18.31% +3.2%
fleurs/ja CER 11.66% 11.68% +0.2%
fleurs/ko WER 13.96% 14.03% +0.5%
fleurs/zh CER 51.20% 51.38% +0.4%
fleurs/ar WER 16.05% 17.09% +6.5%
fleurs/ru WER 5.56% 5.83% +4.8%
cv/en WER 15.88% 16.53% +4.1%
cv/fr WER 10.96% 10.87% -0.9%
cv/de WER 9.63% 9.19% -4.6%
cv/es WER 6.72% 6.72% 0.0%
cv/it WER 6.06% 6.27% +3.4%
cv/pt WER 12.60% 12.44% -1.2%
cv/nl WER 10.60% 10.60% 0.0%
cv/hi WER 20.75% 20.75% 0.0%
cv/ja CER 22.60% 20.72% -8.3%
cv/ko WER 28.85% 29.10% +0.9%
cv/zh CER 27.50% 28.70% +4.4%
cv/ar WER 67.29% 67.87% +0.9%
cv/ru WER 8.18% 7.94% -2.9%

Quality verdict: 26 of 26 language × corpus pairs pass both readings of the 80% threshold:

  • Strict reading (variant_err ≤ 1.25 × base_err): 26 / 26 PASS
  • Permissive reading ((1 − variant_err) ≥ 0.80 × (1 − base_err)): 26 / 26 PASS

The Hindi canary (the only Voxtral-supported indic language, and our proxy for the indic benchmarks added by the organizers) is stable on both corpora.

Verbosity (per-language)

We track the ratio of hypothesis characters to reference characters per language as an indicator of hallucinations and run-on transcriptions. Since the challenge ranks on absolute energy and a more verbose model costs more tokens to generate, this metric is part of our health check.

The variant's verbosity ratio stays within ±5% of baseline on every language × corpus pair. No drift, no hallucination.

Energy (full benchmark suite)

metric baseline (bf16) variant (FP8) ratio
Total energy (CodeCarbon) 1074.2 Wh 878.5 Wh 0.818×
GPU energy only (NVML) 473 Wh 370 Wh 0.782×
Total tokens generated (server-side, vLLM generation_tokens_total) 292 039 296 617 1.016×
Wall-clock duration 15 349 s 12 976 s 0.846×

Tracked GPU: a single physical GPU 1 (CUDA_VISIBLE_DEVICES=1), isolated via CodeCarbon's gpu_ids parameter to avoid contamination from neighbouring jobs on the same multi-GPU host.

Energy verdict: 18.2% total energy savings, 21.8% GPU-only savings, with the variant generating only 1.6% more tokens than baseline (well under the 10% threshold we set as a "verbosity drift" red flag).

The GPU-only ratio (0.782×) is the more relevant figure for the L4 evaluation hardware the organizers use: the L4 host CPU is much smaller than our Threadripper PRO 5995WX, so its share of total energy will be correspondingly smaller. We expect the L4 ratio to land closer to the GPU-only number than to the full-stack number.

How to serve this model

The submission contains only this README.md and vllm_config.yaml. The weights themselves are pulled from the upstream Mistral repository at load time.

Reproduce the evaluation

# Using the exact stack the organizers run :
#   vllm 0.17.1, no transformers patches needed
vllm serve \
    mistralai/Voxtral-Mini-4B-Realtime-2602 \
    --config vllm_config.yaml

vLLM will:

  1. Download the original Mistral consolidated checkpoint (~8 GB) if not already cached.
  2. Read the quantization: fp8 flag from the config.
  3. Load Linear weights of the language model decoder, quantize them to FP8 (E4M3) in memory, keep the audio encoder and lm_head in bf16.
  4. Compile the CUDA graphs and serve the OpenAI-compatible Realtime WebSocket endpoint at /v1/realtime.

Total VRAM footprint at runtime: approximately 4.5 GB for the FP8 language model decoder, plus the bf16 audio encoder and projector, plus the KV cache up to max_model_len=45000. This fits comfortably on an L4 16 GB at gpu_memory_utilization=0.80.

Hardware compatibility

FP8 (E4M3) tensor operations require a GPU with compute capability ≥ 8.9 (Ada Lovelace) or ≥ 9.0 (Hopper). The NVIDIA L4 used for evaluation has compute capability 8.9 and supports the required kernels natively.

License

Apache 2.0, identical to the base model mistralai/Voxtral-Mini-4B-Realtime-2602.

Acknowledgements

  • Mistral AI for releasing Voxtral-Mini-4B-Realtime-2602
  • The vLLM project for the FP8 dynamic quantization pathway and the Mistral checkpoint loader
  • RedHat for publishing the FP8 recipe of Voxtral-Mini-3B-2507, which validated the choice of leaving the audio encoder in bf16
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SuperJerem/voxtral-coalition