Minivoxtral-3-14B-Reasoning-2512_ASR

A tri-modal (text + vision + audio) model built by grafting a Whisper audio encoder onto Mistral's Ministral-3-14B-Reasoning vision-language model.

This is an experimental architecture fusion that extends a vision-language reasoning model with audio understanding capabilities. The audio encoder is extracted from Voxtral-Mini-3B-2507, and a custom audio projector was trained from scratch to map Whisper encoder outputs into the 14B model's representation space. The projector initialization was seeded from Voxtral-Small-24B weights.

This release contains the W4A16 GPTQ 4-bit quantized weights only. The full bf16 weights may be released in the future depending on community interest and compatibility support.

Note: This model requires a custom vLLM plugin to serve, as the tri-modal architecture is not natively supported by any existing framework.

Model Details

Property	Value
Base model	mistralai/Ministral-3-14B-Reasoning-2512
Architecture	MinivoxtralForConditionalGeneration (custom)
Parameters	≈14.6B total (13.9B LM + 637M audio encoder + 52.4M audio projector)
Modalities	Text, Vision (Pixtral), Audio (Whisper)
Precision	W4A16 GPTQ 4-bit (bf16 vision/audio components)
Context length	262,144 tokens (text)
Audio input	16kHz mono, up to 30s chunks
License	Apache 2.0

Architecture

Minivoxtral is a weight-level fusion of three model families into a single tri-modal architecture:

Text + Vision Backbone

The base model is Ministral-3-14B-Reasoning-2512, a Mistral-family model with an integrated Pixtral vision tower. This provides the language modeling backbone (40 transformer layers, 5120 hidden dim, 32 attention heads) and vision capabilities (24-layer ViT with 1024 hidden dim, patch size 14) out of the box.

Audio Encoder (Grafted)

The Whisper-style audio encoder is extracted from Voxtral-Mini-3B-2507. It has 32 transformer layers, 1280 hidden dim, 20 attention heads, and 128 mel bins. The encoder weights (≈637M parameters) are used as-is with no modification.

Audio Projector (Trained)

The audio projector is a 2-layer MLP (linear -> GELU -> linear, no bias, 5120 -> 5120, ≈52.4M parameters) that maps packed audio encoder outputs into the 14B model's hidden space. It was initialized from Voxtral-Small-24B projector weights and then trained from scratch against the frozen 14B backbone on LibriSpeech.

Audio Processing Chain

Audio (16kHz mono)
  -> Whisper Feature Extractor (128 mel bins, 3000 frames)
  -> Whisper Encoder (1280-dim output, 1500 frames)
  -> 4x Frame Packing (reshape to 375 frames x 5120-dim)
  -> Audio Projector MLP (5120 -> 5120)
  -> LM Backbone (5120-dim hidden space)

The 4x frame packing is a fixed reshape operation (not learned) that concatenates 4 consecutive encoder frames to match the LM hidden dimension: 1280 * 4 = 5120.

Weight Organization

The released GPTQ model is organized into 4 safetensors shards (≈10.7 GB total). The GPTQ-quantized LM backbone occupies shards 1-3, while shard 4 contains all unquantized bf16 components: the Pixtral vision tower (218 tensors), vision projector (4 tensors), Whisper audio tower (487 tensors), and audio projector (2 tensors).

Key namespace prefixes: model.layers.* (LM), vision_tower.* (Pixtral), multi_modal_projector.* (vision projector), audio_tower.* (Whisper encoder), audio_multi_modal_projector.* (audio projector).

Training

What Was Trained

Only the audio projector MLP was trained (≈52.4M parameters). The 14B language model backbone, Pixtral vision tower, vision projector, and Whisper audio encoder were all kept frozen.

Training Data

The projector was trained on LibriSpeech (≈280k utterances) for audio-to-text transcription alignment. The training objective was to teach the projector to map Whisper encoder outputs into the 14B's representation space so that the model produces correct transcriptions.

Training Configuration

Setting	Value
Optimizer	AdamW
Learning rate	1e-4
LR schedule	Warmup (100 steps) + cosine decay
Weight decay	0.01
Batch size (per GPU)	2
Gradient accumulation	4 steps
Effective batch size	16
Max gradient norm	1.0
GPUs	2x RTX 3090 Ti (data parallel)
Backbone	GPTQ 4-bit (frozen, for memory efficiency)
Steps completed	4,000 / 35,400 (≈11%)
Training time	≈9 hours 40 minutes

Training Results

Training was stopped early at step 4,000 because validation loss had plateaued since around step 2,500 (hovering around 0.020-0.022). At the final checkpoint: training loss was 0.058 (down from 3.11 at step 1), validation loss was 0.021, and perplexity was 1.020.

The trained projector achieved 0.49% WER on LibriSpeech dev.clean (20-sample evaluation), with 18/20 perfect transcriptions. The two errors were minor proper noun issues.

Key Finding: Projector Scale Convergence

The trained projector's output standard deviation (≈2.52) naturally converged near the Voxtral-24B projector's native scale (≈2.77), which is roughly 500x larger than text embedding scale (≈0.005). This confirmed that the correct operating range for audio projectors in this architecture is at a much higher magnitude than text embeddings, and that naive scalar calibration to text embedding scale is counterproductive.

Evaluation

The evaluation suite is a practical quality snapshot rather than a benchmark-chasing exercise. All tests use frozen prompts, deterministic fixtures, or fixed dataset row indices for reproducibility.

Test Hardware

Component	Value
Serving	vLLM v0.1.dev1
Python	3.14.3t (GIL disabled)
CPU	i9-10900X
GPU1	RTX 3090 Ti
GPU2	RTX 3090 Ti
GPU3	RTX A4000
KV Cache	16,384 tokens, FP16
VRAM usage	14.1 GB (max_num_seqs=1)

Results (3-Run Aggregate)

Suite	Metrics	Score (mean, min-max)	Notes
Text	EM, parse	EM 0.783 (0.750-0.800), parse 1.000	Good text baseline, stable formatting
Vision	EM, parse	EM 0.562 (0.562-0.562), parse 1.000	Moderate visual grounding
Audio ASR	WER, perfect rate	WER 0.0279, perfect 0.750	Strong transcription
Audio QA	EM, alias F1	EM 0.000, F1 0.135	Major weakness (see Limitations)
Multimodal	EM, parse	EM 0.583 (0.583-0.583), parse 1.000	Moderate image+reasoning
Tools	strict order, all-pass	order 0.800, all-pass 0.800	Generally capable
Perf	throughput, p95	25.63 rps, p95 0.176s	Stable serving

Test Battery

Each run uses 263 total requests across: text (20), vision (16), audio ASR (48), audio QA (32), multimodal (12), tools (15), and perf (120).

Test sources: text uses frozen prompt lists with deterministic and JSON-reasoning formats; vision and multimodal use generated deterministic image fixtures (checkerboards, dots, text, color grids, geometric shapes); audio ASR uses LibriSpeech validation clean (frozen row indices); audio QA uses spoken TriviaQA and GSM8K speech (frozen row indices); tools use a deterministic mock tool server with strict scoring.

Limitations

Spoken QA is the largest quality gap. The model achieves 0% exact match on spoken question-answering tasks (TriviaQA/GSM8K speech). This is expected because the projector was only trained on transcription (LibriSpeech), not on instruction-following or reasoning over spoken content. The model tends to repeat input rather than producing answers on math reasoning tasks from speech.

Vision and multimodal reasoning are moderate, reflecting the base Ministral-3-14B's capabilities rather than limitations introduced by the fusion.

Audio projector training was narrow. Only ≈52M parameters were trained on transcription data. Fine-tuning on spoken QA datasets (e.g., GSM8K speech with proper answer supervision) would likely improve semantic understanding from audio.

No native framework support. The tri-modal architecture is not recognized by transformers or vLLM out of the box. A custom vLLM plugin is required for serving (see below).

Multilingual audio is untested. The projector was trained on English (LibriSpeech), but the underlying Whisper encoder and Mistral tokenizer both support multilingual content. In theory, if the projector's learned 4x frame packing mapping is accurate, multilingual audio could work, but this has not been validated.

Serving with vLLM

This model requires a custom vLLM plugin that registers the MinivoxtralForConditionalGeneration architecture. The plugin handles:

Loading mixed checkpoint prefixes (vision/audio/text) into vLLM modules
Whisper q/k/v mapping and audio MLP projection to text hidden space
Expanding multimodal prompt placeholders into correct token spans
Image and audio multimodal preprocessing for OpenAI-compatible API
Reasoning tag parsing ([THINK]...[/THINK]) for HF tokenizer mode

Special Token IDs

Token	ID	Purpose
BOS	1	Beginning of sequence
EOS	2	End of sequence
PAD	11	Padding
IMAGE	10	Image placeholder
AUDIO	24	Audio placeholder
BEGIN_AUDIO	25	Audio sequence start marker

How This Was Built

Why Not Just Graft the 24B Projector?

The Voxtral-Small-24B projector is dimensionally compatible with the 14B model (both use 5120 hidden_size), but produces incoherent outputs when grafted directly. Despite matching dimensions, the 24B projector was trained to map audio to the 24B's internal representation manifold, not the 14B's. Multiple calibration strategies were tested (embed_tokens scale, post-RMSNorm scale, uncalibrated) and all failed to produce correct outputs. This led to training a new projector from scratch.

The 24B Weights Were Still Useful

Initializing the projector from Voxtral-24B weights gave a lower initial training loss (4.89 vs 9.45 with Xavier initialization). While the 24B mapping targets the wrong model's space, it encodes a general audio-to-5120-dim mapping that provides a useful starting point.

Weights & Quantization

This release provides W4A16 GPTQ 4-bit quantized weights only. The LM backbone is quantized with GPTQ (group_size=64, symmetric, 4-bit) while vision and audio components remain in bf16. Total model size is ≈10.7 GB.

The quantized model uses a flat weight namespace (model.layers.*) for the LM backbone, combined with the standard namespace for vision/audio components (vision_tower.*, audio_tower.*, etc.). The custom vLLM plugin handles this mixed namespace automatically.

The full bf16 weights (≈27 GB) may be released separately in the future.

Citation

If you use this model or find the tri-modal fusion approach useful, please cite this repository:

@misc{minivoxtral2026,
  title={Minivoxtral-3-14B-Reasoning-2512_ASR: Tri-modal fusion of Ministral-3-14B with Whisper audio encoder},
  author={hascrack},
  year={2025},
  url={https://huggingface.co/hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR}
}

Acknowledgments

This model builds on the work of Mistral AI (Ministral-3-14B-Reasoning, Voxtral-Mini-3B, Voxtral-Small-24B) and OpenAI (Whisper architecture).

Downloads last month: 9

Inference Providers NEW

Any-to-Any

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR

Base model

mistralai/Ministral-3-14B-Base-2512

Finetuned

mistralai/Ministral-3-14B-Reasoning-2512

Quantized

(32)

this model

hascrack
/

Minivoxtral-3-14B-Reasoning-2512_ASR