Minivoxtral-3-14B-Reasoning-2512_ASR
A tri-modal (text + vision + audio) model built by grafting a Whisper audio encoder onto Mistral's Ministral-3-14B-Reasoning vision-language model.
This is an experimental architecture fusion that extends a vision-language reasoning model with audio understanding capabilities. The audio encoder is extracted from Voxtral-Mini-3B-2507, and a custom audio projector was trained from scratch to map Whisper encoder outputs into the 14B model's representation space. The projector initialization was seeded from Voxtral-Small-24B weights.
This release contains the W4A16 GPTQ 4-bit quantized weights only. The full bf16 weights may be released in the future depending on community interest and compatibility support.
Note: This model requires a custom vLLM plugin to serve, as the tri-modal architecture is not natively supported by any existing framework.
Model Details
| Property | Value |
|---|---|
| Base model | mistralai/Ministral-3-14B-Reasoning-2512 |
| Architecture | MinivoxtralForConditionalGeneration (custom) |
| Parameters | ≈14.6B total (13.9B LM + 637M audio encoder + 52.4M audio projector) |
| Modalities | Text, Vision (Pixtral), Audio (Whisper) |
| Precision | W4A16 GPTQ 4-bit (bf16 vision/audio components) |
| Context length | 262,144 tokens (text) |
| Audio input | 16kHz mono, up to 30s chunks |
| License | Apache 2.0 |
Architecture
Minivoxtral is a weight-level fusion of three model families into a single tri-modal architecture:
Text + Vision Backbone
The base model is Ministral-3-14B-Reasoning-2512, a Mistral-family model with an integrated Pixtral vision tower. This provides the language modeling backbone (40 transformer layers, 5120 hidden dim, 32 attention heads) and vision capabilities (24-layer ViT with 1024 hidden dim, patch size 14) out of the box.
Audio Encoder (Grafted)
The Whisper-style audio encoder is extracted from Voxtral-Mini-3B-2507. It has 32 transformer layers, 1280 hidden dim, 20 attention heads, and 128 mel bins. The encoder weights (≈637M parameters) are used as-is with no modification.
Audio Projector (Trained)
The audio projector is a 2-layer MLP (linear -> GELU -> linear, no bias, 5120 -> 5120, ≈52.4M parameters) that maps packed audio encoder outputs into the 14B model's hidden space. It was initialized from Voxtral-Small-24B projector weights and then trained from scratch against the frozen 14B backbone on LibriSpeech.
Audio Processing Chain
Audio (16kHz mono)
-> Whisper Feature Extractor (128 mel bins, 3000 frames)
-> Whisper Encoder (1280-dim output, 1500 frames)
-> 4x Frame Packing (reshape to 375 frames x 5120-dim)
-> Audio Projector MLP (5120 -> 5120)
-> LM Backbone (5120-dim hidden space)
The 4x frame packing is a fixed reshape operation (not learned) that concatenates 4 consecutive encoder frames to match the LM hidden dimension: 1280 * 4 = 5120.
Weight Organization
The released GPTQ model is organized into 4 safetensors shards (≈10.7 GB total). The GPTQ-quantized LM backbone occupies shards 1-3, while shard 4 contains all unquantized bf16 components: the Pixtral vision tower (218 tensors), vision projector (4 tensors), Whisper audio tower (487 tensors), and audio projector (2 tensors).
Key namespace prefixes: model.layers.* (LM), vision_tower.* (Pixtral), multi_modal_projector.* (vision projector), audio_tower.* (Whisper encoder), audio_multi_modal_projector.* (audio projector).
Training
What Was Trained
Only the audio projector MLP was trained (≈52.4M parameters). The 14B language model backbone, Pixtral vision tower, vision projector, and Whisper audio encoder were all kept frozen.
Training Data
The projector was trained on LibriSpeech (≈280k utterances) for audio-to-text transcription alignment. The training objective was to teach the projector to map Whisper encoder outputs into the 14B's representation space so that the model produces correct transcriptions.
Training Configuration
| Setting | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 1e-4 |
| LR schedule | Warmup (100 steps) + cosine decay |
| Weight decay | 0.01 |
| Batch size (per GPU) | 2 |
| Gradient accumulation | 4 steps |
| Effective batch size | 16 |
| Max gradient norm | 1.0 |
| GPUs | 2x RTX 3090 Ti (data parallel) |
| Backbone | GPTQ 4-bit (frozen, for memory efficiency) |
| Steps completed | 4,000 / 35,400 (≈11%) |
| Training time | ≈9 hours 40 minutes |
Training Results
Training was stopped early at step 4,000 because validation loss had plateaued since around step 2,500 (hovering around 0.020-0.022). At the final checkpoint: training loss was 0.058 (down from 3.11 at step 1), validation loss was 0.021, and perplexity was 1.020.
The trained projector achieved 0.49% WER on LibriSpeech dev.clean (20-sample evaluation), with 18/20 perfect transcriptions. The two errors were minor proper noun issues.
Key Finding: Projector Scale Convergence
The trained projector's output standard deviation (≈2.52) naturally converged near the Voxtral-24B projector's native scale (≈2.77), which is roughly 500x larger than text embedding scale (≈0.005). This confirmed that the correct operating range for audio projectors in this architecture is at a much higher magnitude than text embeddings, and that naive scalar calibration to text embedding scale is counterproductive.
Evaluation
The evaluation suite is a practical quality snapshot rather than a benchmark-chasing exercise. All tests use frozen prompts, deterministic fixtures, or fixed dataset row indices for reproducibility.
Test Hardware
| Component | Value |
|---|---|
| Serving | vLLM v0.1.dev1 |
| Python | 3.14.3t (GIL disabled) |
| CPU | i9-10900X |
| GPU1 | RTX 3090 Ti |
| GPU2 | RTX 3090 Ti |
| GPU3 | RTX A4000 |
| KV Cache | 16,384 tokens, FP16 |
| VRAM usage | 14.1 GB (max_num_seqs=1) |
Results (3-Run Aggregate)
| Suite | Metrics | Score (mean, min-max) | Notes |
|---|---|---|---|
| Text | EM, parse | EM 0.783 (0.750-0.800), parse 1.000 | Good text baseline, stable formatting |
| Vision | EM, parse | EM 0.562 (0.562-0.562), parse 1.000 | Moderate visual grounding |
| Audio ASR | WER, perfect rate | WER 0.0279, perfect 0.750 | Strong transcription |
| Audio QA | EM, alias F1 | EM 0.000, F1 0.135 | Major weakness (see Limitations) |
| Multimodal | EM, parse | EM 0.583 (0.583-0.583), parse 1.000 | Moderate image+reasoning |
| Tools | strict order, all-pass | order 0.800, all-pass 0.800 | Generally capable |
| Perf | throughput, p95 | 25.63 rps, p95 0.176s | Stable serving |
Test Battery
Each run uses 263 total requests across: text (20), vision (16), audio ASR (48), audio QA (32), multimodal (12), tools (15), and perf (120).
Test sources: text uses frozen prompt lists with deterministic and JSON-reasoning formats; vision and multimodal use generated deterministic image fixtures (checkerboards, dots, text, color grids, geometric shapes); audio ASR uses LibriSpeech validation clean (frozen row indices); audio QA uses spoken TriviaQA and GSM8K speech (frozen row indices); tools use a deterministic mock tool server with strict scoring.
Limitations
Spoken QA is the largest quality gap. The model achieves 0% exact match on spoken question-answering tasks (TriviaQA/GSM8K speech). This is expected because the projector was only trained on transcription (LibriSpeech), not on instruction-following or reasoning over spoken content. The model tends to repeat input rather than producing answers on math reasoning tasks from speech.
Vision and multimodal reasoning are moderate, reflecting the base Ministral-3-14B's capabilities rather than limitations introduced by the fusion.
Audio projector training was narrow. Only ≈52M parameters were trained on transcription data. Fine-tuning on spoken QA datasets (e.g., GSM8K speech with proper answer supervision) would likely improve semantic understanding from audio.
No native framework support. The tri-modal architecture is not recognized by transformers or vLLM out of the box. A custom vLLM plugin is required for serving (see below).
Multilingual audio is untested. The projector was trained on English (LibriSpeech), but the underlying Whisper encoder and Mistral tokenizer both support multilingual content. In theory, if the projector's learned 4x frame packing mapping is accurate, multilingual audio could work, but this has not been validated.
Serving with vLLM
This model requires a custom vLLM plugin that registers the MinivoxtralForConditionalGeneration architecture. The plugin handles:
- Loading mixed checkpoint prefixes (vision/audio/text) into vLLM modules
- Whisper q/k/v mapping and audio MLP projection to text hidden space
- Expanding multimodal prompt placeholders into correct token spans
- Image and audio multimodal preprocessing for OpenAI-compatible API
- Reasoning tag parsing (
[THINK]...[/THINK]) for HF tokenizer mode
Special Token IDs
| Token | ID | Purpose |
|---|---|---|
| BOS | 1 | Beginning of sequence |
| EOS | 2 | End of sequence |
| PAD | 11 | Padding |
| IMAGE | 10 | Image placeholder |
| AUDIO | 24 | Audio placeholder |
| BEGIN_AUDIO | 25 | Audio sequence start marker |
How This Was Built
Why Not Just Graft the 24B Projector?
The Voxtral-Small-24B projector is dimensionally compatible with the 14B model (both use 5120 hidden_size), but produces incoherent outputs when grafted directly. Despite matching dimensions, the 24B projector was trained to map audio to the 24B's internal representation manifold, not the 14B's. Multiple calibration strategies were tested (embed_tokens scale, post-RMSNorm scale, uncalibrated) and all failed to produce correct outputs. This led to training a new projector from scratch.
The 24B Weights Were Still Useful
Initializing the projector from Voxtral-24B weights gave a lower initial training loss (4.89 vs 9.45 with Xavier initialization). While the 24B mapping targets the wrong model's space, it encodes a general audio-to-5120-dim mapping that provides a useful starting point.
Weights & Quantization
This release provides W4A16 GPTQ 4-bit quantized weights only. The LM backbone is quantized with GPTQ (group_size=64, symmetric, 4-bit) while vision and audio components remain in bf16. Total model size is ≈10.7 GB.
The quantized model uses a flat weight namespace (model.layers.*) for the LM backbone, combined with the standard namespace for vision/audio components (vision_tower.*, audio_tower.*, etc.). The custom vLLM plugin handles this mixed namespace automatically.
The full bf16 weights (≈27 GB) may be released separately in the future.
Citation
If you use this model or find the tri-modal fusion approach useful, please cite this repository:
@misc{minivoxtral2026,
title={Minivoxtral-3-14B-Reasoning-2512_ASR: Tri-modal fusion of Ministral-3-14B with Whisper audio encoder},
author={hascrack},
year={2025},
url={https://huggingface.co/hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR}
}
Acknowledgments
This model builds on the work of Mistral AI (Ministral-3-14B-Reasoning, Voxtral-Mini-3B, Voxtral-Small-24B) and OpenAI (Whisper architecture).
- Downloads last month
- 9
Model tree for hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR
Base model
mistralai/Ministral-3-14B-Base-2512