You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Voxtral-Mini-4B-Realtime, FP8 dynamic (Round 2)

Compressed variant of mistralai/Voxtral-Mini-4B-Realtime-2602 submitted to the Resilient AI Challenge, Audio-to-Text category.

The compression is performed by vLLM at server startup through the engine's built-in quantization: fp8 pathway. There are no offline weight modifications: the engine loads the original Mistral consolidated checkpoint and quantizes the language-model Linear layers to FP8 in memory before serving.

Launch command used by the organizers (as-is):

vllm serve --config vllm_config.yaml

Method

Quantization scheme

FP8 (E4M3) dynamic quantization on the language decoder Linear layers.
Weights: per-tensor scales computed at load time (round-to-nearest from the bf16 master weights).
Activations: per-token, computed dynamically at each forward pass.
The audio encoder (audio_tower.*) and the multimodal projector (multi_modal_projector.*) are left in bf16.
lm_head is left in bf16 (vLLM default for quantization: fp8).

What is not changed

Architecture: identical to the base model (same config, layers, heads, audio encoder, delay attention scheduling).
Tokenizer: identical (Mistral tekken, tokenizer_mode: mistral).
max_model_len: 45000, kept as in the baseline. The organizers warned that decreasing it affects benchmark performance.
No fine-tuning, no distillation, no architectural change.

Why FP8 dynamic and not a more aggressive scheme

I explored several compression paths for Round 2 before settling on FP8 dynamic. The full write-ups and logs are in exploration/. In short:

GPTQ W4A16: abandoned. llm-compressor's FX tracing of the Voxtral forward crashed inside the audio encoder before calibration could run.
AWQ W4A16: abandoned. Calibration completed, but the reassembled checkpoint failed to load on vLLM with a chain of missing encoder-config attributes that have no Voxtral equivalent.
KV cache FP8: blocked. It requires the FlashInfer backend, which is not available on the Voxtral attention path.
MXFP8: works and loads cleanly, but it is slower and ends up consuming more absolute energy than FP8 on the full suite (see the comparison below). Since the challenge ranks on absolute energy, FP8 is the better submission.

FP8 dynamic loads cleanly, passes the quality gate on all 26 language by corpus pairs, and gives the lowest absolute energy of the working methods.

Results (internal measurements)

Hardware and stack for these numbers: NVIDIA RTX 4090 (24 GB, compute capability 8.9), single GPU isolated via CUDA_VISIBLE_DEVICES=2, vLLM 0.22, CodeCarbon in process tracking mode with gpu_ids pinned to the visible device to isolate the run from other users on the shared host.

Quality is measured on all 13 Voxtral languages across 2 standard ASR corpora (FLEURS, Common Voice 17). For Japanese and Chinese I report CER, because WER is misleading on references with no inter-word spaces.

Quality (per-language, FP8 vs bf16 baseline)

dataset	metric	base	FP8	Δrel
fleurs/en	WER	8.90%	8.31%	-6.7%
fleurs/fr	WER	9.50%	9.39%	-1.2%
fleurs/de	WER	5.98%	5.47%	-8.6%
fleurs/es	WER	3.00%	3.58%	+19.2%
fleurs/it	WER	4.60%	4.71%	+2.5%
fleurs/pt	WER	4.41%	4.25%	-3.6%
fleurs/nl	WER	8.59%	9.03%	+5.2%
fleurs/hi	WER	17.72%	17.13%	-3.3%
fleurs/ja	CER	11.66%	11.60%	-0.5%
fleurs/ko	WER	13.96%	13.89%	-0.5%
fleurs/zh	CER	51.24%	51.34%	+0.2%
fleurs/ar	WER	16.05%	16.99%	+5.9%
fleurs/ru	WER	5.45%	5.61%	+2.9%
cv/en	WER	15.56%	15.88%	+2.1%
cv/fr	WER	10.96%	10.40%	-5.2%
cv/de	WER	8.64%	9.97%	+15.4%
cv/es	WER	6.51%	6.72%	+3.2%
cv/it	WER	6.47%	6.89%	+6.3%
cv/pt	WER	10.06%	10.73%	+6.6%
cv/nl	WER	10.94%	11.16%	+2.0%
cv/hi	WER	19.38%	20.04%	+3.4%
cv/ja	CER	25.37%	21.50%	-15.2%
cv/ko	WER	29.23%	29.23%	+0.0%
cv/zh	CER	26.97%	30.11%	+11.6%
cv/ar	WER	58.47%	60.49%	+3.5%
cv/ru	WER	8.29%	8.06%	-2.8%

Quality verdict: 26 of 26 language by corpus pairs pass both readings of the 80% threshold.

Strict reading (variant_err <= 1.25 * base_err): 26 / 26 PASS.
Permissive reading ((1 - variant_err) >= 0.80 * (1 - base_err)): 26 / 26 PASS.

The Hindi pair (the only Voxtral-supported indic language, my proxy for the indic benchmarks added by the organizers) is stable on both corpora.

Energy (full benchmark suite, 26 language by corpus pairs)

metric	baseline (bf16)	FP8	ratio
Total energy (CodeCarbon)	570.9 Wh	515.0 Wh	0.902x
GPU energy only (NVML)	530.3 Wh	469.6 Wh	0.885x
Generation tokens (server-side)	291,991	304,137	1.042x
Wall-clock duration	13,813 s	15,834 s	1.146x

Energy verdict: 9.8% total energy savings, 11.5% GPU-only savings, with the variant generating 4.2% more tokens than the baseline.

The GPU-only ratio (0.885x) is the more relevant figure for the L4 evaluation hardware: the L4 host CPU is much smaller than the Threadripper PRO 5995WX of my dev machine, so the CPU share of total energy will be smaller on the L4 and the ratio should land closer to the GPU-only number.

Comparison with MXFP8 (why it was not submitted)

I also ran a full MXFP8 evaluation on the same stack (vLLM 0.22, same suite, same protocol). MXFP8 quantizes more aggressively and has a smaller loaded footprint (5.26 GiB versus 8.43 GiB for bf16), and it passes the quality gate on all 26 pairs as well. But it loses on the criterion that decides the ranking, absolute energy:

metric (26 pairs)	baseline	FP8	MXFP8
Total energy	570.9 Wh	515.0 Wh	548.7 Wh
Savings vs baseline	reference	9.8%	3.9%
Wall-clock duration	3.84 h	4.40 h	5.83 h

MXFP8 consumes 6.5% more absolute energy than FP8 and is markedly slower (5.83 h versus 4.40 h). The Marlin MXFP8 GEMM kernel is slower than the native FP8 path, and the extra GPU time outweighs the more aggressive weight compression in terms of total energy. The smaller memory footprint does not affect the ranking, which is based on energy. For these reasons I submit FP8 and keep MXFP8 documented as an explored alternative in exploration/.

(The MXFP8 total above sums two measured runs on the same stack: the main run covering 24 language by corpus pairs plus a follow-up run covering Russian, which the main run had not yet evaluated.)

How to serve this model

The submission contains the original Mistral weights, vllm_config.yaml, and this README. The config:

model: mistralai/Voxtral-Mini-4B-Realtime-2602
tokenizer_mode: mistral
config_format: mistral
load_format: mistral
trust_remote_code: true
quantization: fp8
max_model_len: 45000
gpu_memory_utilization: 0.80
compilation_config:
  cudagraph_mode: PIECEWISE

vLLM reads quantization: fp8, loads the language decoder Linear weights and quantizes them to FP8 (E4M3) in memory, keeps the audio encoder and lm_head in bf16, then serves the OpenAI-compatible Realtime WebSocket endpoint.

Known issues at launch

Two patches are needed in the venv before serving Voxtral Realtime on this stack, including the bf16 baseline (they are not specific to FP8). Both are in patches_required/ with shell scripts that apply them:

transformers/tokenization_mistral_common.py: widen the kwargs whitelist.
vllm/transformers_utils/processors/voxtral.py: add a fetch_audio method to MistralCommonFeatureExtractor.

Hardware compatibility

FP8 (E4M3) tensor operations require compute capability 8.9 (Ada Lovelace) or higher. The NVIDIA L4 used for evaluation is compute capability 8.9 and supports the required kernels natively.

License

Apache 2.0, identical to the base model.

Acknowledgements

Mistral AI for releasing Voxtral-Mini-4B-Realtime-2602.
The vLLM project for the FP8 dynamic quantization pathway and the Mistral checkpoint loader.

Downloads last month: 53

Model tree for ltl1605/voxtral-fp8-round2

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Finetuned

(19)

this model