Voxtral Mini 4B Realtime first optimization by Atos for the AI Resilient Challenge
This repository contains our first optimization pass for Voxtral Mini 4B Realtime. The objective was to reduce serving cost and energy use while keeping the model in a standard Hugging Face format that still works for vLLM deployment.
What changed
- We compressed only the language model branch,
model.language_model, because that is where most decoding compute and memory traffic happens during transcription. - We left the audio encoder,
model.audio_tower, untouched to preserve the realtime audio path and avoid degrading streaming speech features. - We also left the multimodal projector untouched so the audio-to-text bridge remains identical to the original model.
- We applied Pruna AWQ with
W4A16on the language branch. This is the main compression step and reduces weight precision to lower memory movement and serving energy during generation. - We calibrated AWQ with the
WikiTextdataset using64samples and a sequence length of512so quantization stays grounded in real text activations. - We added conservative Pruna
torch_unstructuredpruning with methodl1at sparsity5.00%on the same language branch. This is a secondary optimization kept intentionally small to avoid destabilizing the model.
Energy rationale
The main energy saving comes from quantizing the language model weights, reducing memory footprint and bandwidth pressure during autoregressive decoding. That usually matters more than touching the audio stack, so we concentrated the compression where the serving cost is highest and kept the speech front-end unchanged for reliability (for now).
Usage
vllm serve --config vllm_config.yaml
- Downloads last month
- 38
Model tree for pavanperi/voxtral-optimized-atos
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602