| # Voxtral-Mini-3B (ExecuTorch, XNNPACK, 8da4w) |
|
|
| This folder contains an ExecuTorch .pte export of https://huggingface.co/mistralai/Voxtral-Mini-3B-2507 for CPU inference via the XNNPACK backend, with post-training quantization enabled. Voxtral is a multimodal speech-language |
| model that accepts audio and text inputs. |
|
|
| ## Contents |
|
|
| - model.pte: ExecuTorch program |
| - voxtral_preprocessor.pte: Audio preprocessor (mel spectrogram extractor) |
| |
| ## Quantization |
| |
| - --qlinear 8da4w: text decoder linear layers use 8-bit dynamic activations + 4-bit weights |
| - --qlinear_encoder 8da4w: audio encoder linear layers use 8-bit dynamic activations + 4-bit weights |
| - --qembedding 4w: embeddings use 4-bit weights |
|
|
| ## Export model |
| ``` |
| pip install mistral_common |
| |
| optimum-cli export executorch \ |
| --model "mistralai/Voxtral-Mini-3B-2507" \ |
| --task "multimodal-text-to-text" \ |
| --recipe "xnnpack" \ |
| --use_custom_sdpa \ |
| --use_custom_kv_cache \ |
| --max_seq_len 2048 \ |
| --qlinear 8da4w \ |
| --qlinear_encoder 8da4w \ |
| --qembedding 4w \ |
| --output_dir="voxtral" |
| ``` |
| ## Export audio preprocessor (supports up to 5 min / 300s audio) |
| ``` |
| python -m executorch.extension.audio.mel_spectrogram \ |
| --feature_size 128 \ |
| --stack_output \ |
| --max_audio_len 300 \ |
| --output_file voxtral_preprocessor.pte |
| ``` |
| ## Run |
| Download tokenizer |
| ``` |
| curl -L https://huggingface.co/mistralai/Voxtral-Mini-3B-2507/resolve/main/tekken.json --output tekken.json |
| ``` |
| Build the runner from the ExecuTorch repo root |
| ``` |
| make voxtral-cpu |
| ``` |
| Run model |
| ``` |
| ./cmake-out/examples/models/voxtral/voxtral_runner \ |
| --model_path "model.pte" \ |
| --tokenizer_path "tekken.json" \ |
| --prompt "What can you tell me about this audio?" \ |
| --audio_path "audio.wav" \ |
| --processor_path "voxtral_preprocessor.pte" \ |
| --temperature 0 |
| ``` |