Voxtral-Mini-4B-Realtime-2602 โ€” ExecuTorch CUDA

Pre-exported ExecuTorch .pte files for Voxtral-Mini-4B-Realtime-2602, Mistral's ~4B parameter streaming speech-to-text model.

This repository contains the CUDA (NVIDIA GPU) variant. See also: Metal (Apple GPU) variant.

Files

File Description
model.pte Exported model (audio encoder + text decoder + token embedding)
aoti_cuda_blob.ptd AOTInductor CUDA delegate data (compiled kernels + weights)
preprocessor.pte Mel spectrogram preprocessor for audio-to-mel conversion

Export Configuration

Backend:           cuda (AOTInductor)
Dtype:             bf16
Mode:              streaming
Encoder quant:     4w (group_size=32, tile_packed_to_4d)
Decoder quant:     4w (group_size=32, tile_packed_to_4d)
Embedding quant:   8w
Max seq len:       4096
Delay tokens:      6 (480ms)

Usage

Prerequisites

Build the runner

make voxtral_realtime-cuda     # Linux

Windows (PowerShell):

cmake --workflow --preset llm-release-cuda
Push-Location examples/models/voxtral_realtime
cmake --workflow --preset voxtral-realtime-cuda
Pop-Location

Run inference

cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
    --model_path model.pte \
    --data_path aoti_cuda_blob.ptd \
    --tokenizer_path tekken.json \
    --preprocessor_path preprocessor.pte \
    --audio_path input.wav \
    --streaming

Windows (PowerShell):

.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
    --model_path model.pte `
    --data_path aoti_cuda_blob.ptd `
    --tokenizer_path tekken.json `
    --preprocessor_path preprocessor.pte `
    --audio_path input.wav `
    --streaming

Live microphone input

ffmpeg -f alsa -i default -ar 16000 -ac 1 -f f32le -nostats -loglevel error pipe:1 | \
  cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
    --model_path model.pte \
    --data_path aoti_cuda_blob.ptd \
    --tokenizer_path tekken.json \
    --preprocessor_path preprocessor.pte \
    --mic

Notes

  • Input audio must be 16kHz mono WAV. Convert with: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
  • The --data_path flag pointing to aoti_cuda_blob.ptd is required for CUDA backends.
  • The model was trained with --delay-tokens 6. Other values may degrade accuracy.
  • For more details, see the Voxtral Realtime README.
Downloads last month
11
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA