| --- |
| base_model: mistralai/Voxtral-Mini-3B-2507 |
| library_name: transformers |
| license: apache-2.0 |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - voxtral |
| - audio |
| - speech |
| - speech-recognition |
| - transcription |
| - translation |
| - kv-cache |
| - rotorquant |
| - quantization |
| --- |
| |
| # Voxtral-Mini-3B-2507-RotorQuant |
|
|
| RotorQuant KV-cache for [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507). Uses a rotational online re-basis of the attention cache that is robust to distributional drift across long, code-switched, or noisy audio streams. |
|
|
| This artifact ships **only the quantized KV-cache bundle** β model weights load from the upstream repo. |
|
|
| ## Overview |
|
|
| - **Base model:** `mistralai/Voxtral-Mini-3B-2507` |
| - **Capabilities:** transcription, speech translation, audio understanding |
| - **Quantization target:** attention KV-cache only |
| - **Method:** RotorQuant β orthogonal rotation + low-bit quantization, refreshed per session |
|
|
| RotorQuant trades a tiny per-session calibration pass for better low-bit stability on streaming audio. Preferred when audio domains shift mid-stream (multi-speaker meetings, code-switching, noise bursts). |
|
|
| ## Quickstart |
|
|
| ```python |
| from transformers import VoxtralForConditionalGeneration, AutoProcessor |
| from majentik_quant import RotorQuantCache |
| |
| model_id = "mistralai/Voxtral-Mini-3B-2507" |
| processor = AutoProcessor.from_pretrained(model_id) |
| model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto") |
| |
| cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-RotorQuant") |
| |
| inputs = processor(audio="meeting.wav", return_tensors="pt") |
| out = model.generate(**inputs, past_key_values=cache, max_new_tokens=512) |
| print(processor.batch_decode(out, skip_special_tokens=True)[0]) |
| ``` |
|
|
| ## Model specs |
|
|
| | Field | Value | |
| |---|---| |
| | Parameters | 3B | |
| | Modality | Audio-in, text-out | |
| | Languages | Multilingual (24+) | |
| | Cache quantization | RotorQuant (rotated int4) | |
| | License | Apache 2.0 | |
|
|
| ## RotorQuant vs TurboQuant |
|
|
| | | RotorQuant | TurboQuant | |
| |---|---|---| |
| | Strategy | Rotational online re-basis | Per-head static calibration | |
| | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache | |
| | Best for | Streaming, code-switching audio | Batch transcription, fixed domains | |
| | Calibration cost | Per-session light re-basis | One-shot, fast | |
|
|
| ## See also |
|
|
| - [`majentik/Voxtral-Mini-3B-2507-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-TurboQuant) β static calibrated KV-cache variant |
| - [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit) β MLX weight-quantized 8-bit |
| - [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit) β MLX weight-quantized 4-bit |
| - [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit) β MLX weight-quantized 2-bit |
| - [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) β upstream base model |
|
|