Voxtral-Mini-4B-Realtime-2602 — ExecuTorch CUDA

Pre-exported ExecuTorch .pte files for Voxtral-Mini-4B-Realtime-2602, Mistral's ~4B parameter streaming speech-to-text model.

This repository contains the CUDA (NVIDIA GPU) variant. See also: Metal (Apple GPU) variant.

Files

File	Description
`model.pte`	Exported model (audio encoder + text decoder + token embedding)
`aoti_cuda_blob.ptd`	AOTInductor CUDA delegate data (compiled kernels + weights)
`preprocessor.pte`	Mel spectrogram preprocessor for audio-to-mel conversion

Export Configuration

Backend:           cuda (AOTInductor)
Dtype:             bf16
Mode:              streaming
Encoder quant:     4w (group_size=32, tile_packed_to_4d)
Decoder quant:     4w (group_size=32, tile_packed_to_4d)
Embedding quant:   8w
Max seq len:       4096
Delay tokens:      6 (480ms)

Usage

Prerequisites

ExecuTorch installed from source with CUDA support (see building from source)
NVIDIA GPU with CUDA support
tekken.json tokenizer from the original model

Build the runner

make voxtral_realtime-cuda     # Linux

Windows (PowerShell):

cmake --workflow --preset llm-release-cuda
Push-Location examples/models/voxtral_realtime
cmake --workflow --preset voxtral-realtime-cuda
Pop-Location

Run inference

cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
    --model_path model.pte \
    --data_path aoti_cuda_blob.ptd \
    --tokenizer_path tekken.json \
    --preprocessor_path preprocessor.pte \
    --audio_path input.wav \
    --streaming

Windows (PowerShell):

.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe `
    --model_path model.pte `
    --data_path aoti_cuda_blob.ptd `
    --tokenizer_path tekken.json `
    --preprocessor_path preprocessor.pte `
    --audio_path input.wav `
    --streaming

Live microphone input

ffmpeg -f alsa -i default -ar 16000 -ac 1 -f f32le -nostats -loglevel error pipe:1 | \
  cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \
    --model_path model.pte \
    --data_path aoti_cuda_blob.ptd \
    --tokenizer_path tekken.json \
    --preprocessor_path preprocessor.pte \
    --mic

Notes

Input audio must be 16kHz mono WAV. Convert with: ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav
The --data_path flag pointing to aoti_cuda_blob.ptd is required for CUDA backends.
The model was trained with --delay-tokens 6. Other values may degrade accuracy.
For more details, see the Voxtral Realtime README.

Downloads last month: 11

Model tree for younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA

Base model

mistralai/Ministral-3-3B-Base-2512

Finetuned

mistralai/Voxtral-Mini-4B-Realtime-2602

Quantized

(19)

this model