Voxtral Mini 4B Realtime first optimization by Atos for the AI Resilient Challenge

This repository contains our first optimization pass for Voxtral Mini 4B Realtime. The objective was to reduce serving cost and energy use while keeping the model in a standard Hugging Face format that still works for vLLM deployment.

What changed

We compressed only the language model branch, model.language_model, because that is where most decoding compute and memory traffic happens during transcription.
We left the audio encoder, model.audio_tower, untouched to preserve the realtime audio path and avoid degrading streaming speech features.
We also left the multimodal projector untouched so the audio-to-text bridge remains identical to the original model.
We applied Pruna AWQ with W4A16 on the language branch. This is the main compression step and reduces weight precision to lower memory movement and serving energy during generation.
We calibrated AWQ with the WikiText dataset using 64 samples and a sequence length of 512 so quantization stays grounded in real text activations.
We added conservative Pruna torch_unstructured pruning with method l1 at sparsity 5.00% on the same language branch. This is a secondary optimization kept intentionally small to avoid destabilizing the model.

Energy rationale

The main energy saving comes from quantizing the language model weights, reducing memory footprint and bandwidth pressure during autoregressive decoding. That usually matters more than touching the audio stack, so we concentrated the compression where the serving cost is highest and kept the speech front-end unchanged for reliability (for now).

Usage

vllm serve --config vllm_config.yaml

Downloads last month: 38

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for pavanperi/voxtral-optimized-atos

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Finetuned

(18)

this model