Minivoxtral-3-14B-Reasoning-2512_ASR

A tri-modal (text + vision + audio) model built by grafting a Whisper audio encoder onto Mistral's Ministral-3-14B-Reasoning vision-language model.

This is an experimental architecture fusion that extends a vision-language reasoning model with audio understanding capabilities. The audio encoder is extracted from Voxtral-Mini-3B-2507, and a custom audio projector was trained from scratch to map Whisper encoder outputs into the 14B model's representation space. The projector initialization was seeded from Voxtral-Small-24B weights.

This release contains the W4A16 GPTQ 4-bit quantized weights only. The full bf16 weights may be released in the future depending on community interest and compatibility support.

Note: This model requires a custom vLLM plugin to serve, as the tri-modal architecture is not natively supported by any existing framework.

Model Details

Property Value
Base model mistralai/Ministral-3-14B-Reasoning-2512
Architecture MinivoxtralForConditionalGeneration (custom)
Parameters ≈14.6B total (13.9B LM + 637M audio encoder + 52.4M audio projector)
Modalities Text, Vision (Pixtral), Audio (Whisper)
Precision W4A16 GPTQ 4-bit (bf16 vision/audio components)
Context length 262,144 tokens (text)
Audio input 16kHz mono, up to 30s chunks
License Apache 2.0

Architecture

Minivoxtral is a weight-level fusion of three model families into a single tri-modal architecture:

Text + Vision Backbone

The base model is Ministral-3-14B-Reasoning-2512, a Mistral-family model with an integrated Pixtral vision tower. This provides the language modeling backbone (40 transformer layers, 5120 hidden dim, 32 attention heads) and vision capabilities (24-layer ViT with 1024 hidden dim, patch size 14) out of the box.

Audio Encoder (Grafted)

The Whisper-style audio encoder is extracted from Voxtral-Mini-3B-2507. It has 32 transformer layers, 1280 hidden dim, 20 attention heads, and 128 mel bins. The encoder weights (≈637M parameters) are used as-is with no modification.

Audio Projector (Trained)

The audio projector is a 2-layer MLP (linear -> GELU -> linear, no bias, 5120 -> 5120, ≈52.4M parameters) that maps packed audio encoder outputs into the 14B model's hidden space. It was initialized from Voxtral-Small-24B projector weights and then trained from scratch against the frozen 14B backbone on LibriSpeech.

Audio Processing Chain

Audio (16kHz mono)
  -> Whisper Feature Extractor (128 mel bins, 3000 frames)
  -> Whisper Encoder (1280-dim output, 1500 frames)
  -> 4x Frame Packing (reshape to 375 frames x 5120-dim)
  -> Audio Projector MLP (5120 -> 5120)
  -> LM Backbone (5120-dim hidden space)

The 4x frame packing is a fixed reshape operation (not learned) that concatenates 4 consecutive encoder frames to match the LM hidden dimension: 1280 * 4 = 5120.

Weight Organization

The released GPTQ model is organized into 4 safetensors shards (≈10.7 GB total). The GPTQ-quantized LM backbone occupies shards 1-3, while shard 4 contains all unquantized bf16 components: the Pixtral vision tower (218 tensors), vision projector (4 tensors), Whisper audio tower (487 tensors), and audio projector (2 tensors).

Key namespace prefixes: model.layers.* (LM), vision_tower.* (Pixtral), multi_modal_projector.* (vision projector), audio_tower.* (Whisper encoder), audio_multi_modal_projector.* (audio projector).

Training

What Was Trained

Only the audio projector MLP was trained (≈52.4M parameters). The 14B language model backbone, Pixtral vision tower, vision projector, and Whisper audio encoder were all kept frozen.

Training Data

The projector was trained on LibriSpeech (≈280k utterances) for audio-to-text transcription alignment. The training objective was to teach the projector to map Whisper encoder outputs into the 14B's representation space so that the model produces correct transcriptions.

Training Configuration

Setting Value
Optimizer AdamW
Learning rate 1e-4
LR schedule Warmup (100 steps) + cosine decay
Weight decay 0.01
Batch size (per GPU) 2
Gradient accumulation 4 steps
Effective batch size 16
Max gradient norm 1.0
GPUs 2x RTX 3090 Ti (data parallel)
Backbone GPTQ 4-bit (frozen, for memory efficiency)
Steps completed 4,000 / 35,400 (≈11%)
Training time ≈9 hours 40 minutes

Training Results

Training was stopped early at step 4,000 because validation loss had plateaued since around step 2,500 (hovering around 0.020-0.022). At the final checkpoint: training loss was 0.058 (down from 3.11 at step 1), validation loss was 0.021, and perplexity was 1.020.

The trained projector achieved 0.49% WER on LibriSpeech dev.clean (20-sample evaluation), with 18/20 perfect transcriptions. The two errors were minor proper noun issues.

Key Finding: Projector Scale Convergence

The trained projector's output standard deviation (≈2.52) naturally converged near the Voxtral-24B projector's native scale (≈2.77), which is roughly 500x larger than text embedding scale (≈0.005). This confirmed that the correct operating range for audio projectors in this architecture is at a much higher magnitude than text embeddings, and that naive scalar calibration to text embedding scale is counterproductive.

Evaluation

The evaluation suite is a practical quality snapshot rather than a benchmark-chasing exercise. All tests use frozen prompts, deterministic fixtures, or fixed dataset row indices for reproducibility.

Test Hardware

Component Value
Serving vLLM v0.1.dev1
Python 3.14.3t (GIL disabled)
CPU i9-10900X
GPU1 RTX 3090 Ti
GPU2 RTX 3090 Ti
GPU3 RTX A4000
KV Cache 16,384 tokens, FP16
VRAM usage 14.1 GB (max_num_seqs=1)

Results (3-Run Aggregate)

Suite Metrics Score (mean, min-max) Notes
Text EM, parse EM 0.783 (0.750-0.800), parse 1.000 Good text baseline, stable formatting
Vision EM, parse EM 0.562 (0.562-0.562), parse 1.000 Moderate visual grounding
Audio ASR WER, perfect rate WER 0.0279, perfect 0.750 Strong transcription
Audio QA EM, alias F1 EM 0.000, F1 0.135 Major weakness (see Limitations)
Multimodal EM, parse EM 0.583 (0.583-0.583), parse 1.000 Moderate image+reasoning
Tools strict order, all-pass order 0.800, all-pass 0.800 Generally capable
Perf throughput, p95 25.63 rps, p95 0.176s Stable serving

Test Battery

Each run uses 263 total requests across: text (20), vision (16), audio ASR (48), audio QA (32), multimodal (12), tools (15), and perf (120).

Test sources: text uses frozen prompt lists with deterministic and JSON-reasoning formats; vision and multimodal use generated deterministic image fixtures (checkerboards, dots, text, color grids, geometric shapes); audio ASR uses LibriSpeech validation clean (frozen row indices); audio QA uses spoken TriviaQA and GSM8K speech (frozen row indices); tools use a deterministic mock tool server with strict scoring.

Limitations

Spoken QA is the largest quality gap. The model achieves 0% exact match on spoken question-answering tasks (TriviaQA/GSM8K speech). This is expected because the projector was only trained on transcription (LibriSpeech), not on instruction-following or reasoning over spoken content. The model tends to repeat input rather than producing answers on math reasoning tasks from speech.

Vision and multimodal reasoning are moderate, reflecting the base Ministral-3-14B's capabilities rather than limitations introduced by the fusion.

Audio projector training was narrow. Only ≈52M parameters were trained on transcription data. Fine-tuning on spoken QA datasets (e.g., GSM8K speech with proper answer supervision) would likely improve semantic understanding from audio.

No native framework support. The tri-modal architecture is not recognized by transformers or vLLM out of the box. A custom vLLM plugin is required for serving (see below).

Multilingual audio is untested. The projector was trained on English (LibriSpeech), but the underlying Whisper encoder and Mistral tokenizer both support multilingual content. In theory, if the projector's learned 4x frame packing mapping is accurate, multilingual audio could work, but this has not been validated.

Serving with vLLM

This model requires a custom vLLM plugin that registers the MinivoxtralForConditionalGeneration architecture. The plugin handles:

  • Loading mixed checkpoint prefixes (vision/audio/text) into vLLM modules
  • Whisper q/k/v mapping and audio MLP projection to text hidden space
  • Expanding multimodal prompt placeholders into correct token spans
  • Image and audio multimodal preprocessing for OpenAI-compatible API
  • Reasoning tag parsing ([THINK]...[/THINK]) for HF tokenizer mode

Special Token IDs

Token ID Purpose
BOS 1 Beginning of sequence
EOS 2 End of sequence
PAD 11 Padding
IMAGE 10 Image placeholder
AUDIO 24 Audio placeholder
BEGIN_AUDIO 25 Audio sequence start marker

How This Was Built

Why Not Just Graft the 24B Projector?

The Voxtral-Small-24B projector is dimensionally compatible with the 14B model (both use 5120 hidden_size), but produces incoherent outputs when grafted directly. Despite matching dimensions, the 24B projector was trained to map audio to the 24B's internal representation manifold, not the 14B's. Multiple calibration strategies were tested (embed_tokens scale, post-RMSNorm scale, uncalibrated) and all failed to produce correct outputs. This led to training a new projector from scratch.

The 24B Weights Were Still Useful

Initializing the projector from Voxtral-24B weights gave a lower initial training loss (4.89 vs 9.45 with Xavier initialization). While the 24B mapping targets the wrong model's space, it encodes a general audio-to-5120-dim mapping that provides a useful starting point.

Weights & Quantization

This release provides W4A16 GPTQ 4-bit quantized weights only. The LM backbone is quantized with GPTQ (group_size=64, symmetric, 4-bit) while vision and audio components remain in bf16. Total model size is ≈10.7 GB.

The quantized model uses a flat weight namespace (model.layers.*) for the LM backbone, combined with the standard namespace for vision/audio components (vision_tower.*, audio_tower.*, etc.). The custom vLLM plugin handles this mixed namespace automatically.

The full bf16 weights (≈27 GB) may be released separately in the future.

Citation

If you use this model or find the tri-modal fusion approach useful, please cite this repository:

@misc{minivoxtral2026,
  title={Minivoxtral-3-14B-Reasoning-2512_ASR: Tri-modal fusion of Ministral-3-14B with Whisper audio encoder},
  author={hascrack},
  year={2025},
  url={https://huggingface.co/hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR}
}

Acknowledgments

This model builds on the work of Mistral AI (Ministral-3-14B-Reasoning, Voxtral-Mini-3B, Voxtral-Small-24B) and OpenAI (Whisper architecture).

Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR

Dataset used to train hascrack/Minivoxtral-3-14B-Reasoning-2512_ASR