arxiv:2602.11298

Voxtral Realtime

Published on Feb 11

· Submitted by

taesiri on Feb 13

Mistral AI_

Upvote

Authors:

Abstract

Voxtral Realtime is a streaming speech recognition model trained end-to-end for sub-second latency with performance matching offline systems.

AI-generated summary

We introduce Voxtral Realtime, a natively streaming automatic speech recognition model that matches offline transcription quality at sub-second latency. Unlike approaches that adapt offline models through chunking or sliding windows, Voxtral Realtime is trained end-to-end for streaming, with explicit alignment between audio and text streams. Our architecture builds on the Delayed Streams Modeling framework, introducing a new causal audio encoder and Ada RMS-Norm for improved delay conditioning. We scale pretraining to a large-scale dataset spanning 13 languages. At a delay of 480ms, Voxtral Realtime achieves performance on par with Whisper, the most widely deployed offline transcription system. We release the model weights under the Apache 2.0 license.