Voxtral-Mini-4B-Realtime-2602 โ ExecuTorch CUDA
Pre-exported ExecuTorch .pte files
for Voxtral-Mini-4B-Realtime-2602,
Mistral's ~4B parameter streaming speech-to-text model.
This repository contains the CUDA (NVIDIA GPU) variant. See also: Metal (Apple GPU) variant.
Files
| File | Description |
|---|---|
model.pte |
Exported model (audio encoder + text decoder + token embedding) |
aoti_cuda_blob.ptd |
AOTInductor CUDA delegate data (compiled kernels + weights) |
preprocessor.pte |
Mel spectrogram preprocessor for audio-to-mel conversion |
Export Configuration
Backend: cuda (AOTInductor)
Dtype: bf16
Mode: streaming
Encoder quant: 4w (group_size=32, tile_packed_to_4d)
Decoder quant: 4w (group_size=32, tile_packed_to_4d)
Embedding quant: 8w
Max seq len: 4096
Delay tokens: 6 (480ms)
Usage
Prerequisites
- ExecuTorch installed from source with CUDA support (see building from source)
- NVIDIA GPU with CUDA support
tekken.jsontokenizer from the original model
Build the runner
make voxtral_realtime-cuda # Linux
Windows (PowerShell):
cmake --workflow --preset llm-release-cuda
Push-Location examples/models/voxtral_realtime
cmake --workflow --preset voxtral-realtime-cuda
Pop-Location
Run inference
cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
--model_path model.pte \
--data_path aoti_cuda_blob.ptd \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--audio_path input.wav \
--streaming
Windows (PowerShell):
.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
--model_path model.pte `
--data_path aoti_cuda_blob.ptd `
--tokenizer_path tekken.json `
--preprocessor_path preprocessor.pte `
--audio_path input.wav `
--streaming
Live microphone input
ffmpeg -f alsa -i default -ar 16000 -ac 1 -f f32le -nostats -loglevel error pipe:1 | \
cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
--model_path model.pte \
--data_path aoti_cuda_blob.ptd \
--tokenizer_path tekken.json \
--preprocessor_path preprocessor.pte \
--mic
Notes
- Input audio must be 16kHz mono WAV. Convert with:
ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav - The
--data_pathflag pointing toaoti_cuda_blob.ptdis required for CUDA backends. - The model was trained with
--delay-tokens 6. Other values may degrade accuracy. - For more details, see the Voxtral Realtime README.
- Downloads last month
- 11
Model tree for younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602