--- base_model: mistralai/Voxtral-Mini-3B-2507 library_name: transformers license: apache-2.0 pipeline_tag: automatic-speech-recognition tags: - voxtral - audio - speech - speech-recognition - transcription - translation - kv-cache - rotorquant - quantization --- # Voxtral-Mini-3B-2507-RotorQuant RotorQuant KV-cache for [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507). Uses a rotational online re-basis of the attention cache that is robust to distributional drift across long, code-switched, or noisy audio streams. This artifact ships **only the quantized KV-cache bundle** — model weights load from the upstream repo. ## Overview - **Base model:** `mistralai/Voxtral-Mini-3B-2507` - **Capabilities:** transcription, speech translation, audio understanding - **Quantization target:** attention KV-cache only - **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session RotorQuant trades a tiny per-session calibration pass for better low-bit stability on streaming audio. Preferred when audio domains shift mid-stream (multi-speaker meetings, code-switching, noise bursts). ## Quickstart ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor from majentik_quant import RotorQuantCache model_id = "mistralai/Voxtral-Mini-3B-2507" processor = AutoProcessor.from_pretrained(model_id) model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto") cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-RotorQuant") inputs = processor(audio="meeting.wav", return_tensors="pt") out = model.generate(**inputs, past_key_values=cache, max_new_tokens=512) print(processor.batch_decode(out, skip_special_tokens=True)[0]) ``` ## Model specs | Field | Value | |---|---| | Parameters | 3B | | Modality | Audio-in, text-out | | Languages | Multilingual (24+) | | Cache quantization | RotorQuant (rotated int4) | | License | Apache 2.0 | ## RotorQuant vs TurboQuant | | RotorQuant | TurboQuant | |---|---|---| | Strategy | Rotational online re-basis | Per-head static calibration | | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache | | Best for | Streaming, code-switching audio | Batch transcription, fixed domains | | Calibration cost | Per-session light re-basis | One-shot, fast | ## See also - [`majentik/Voxtral-Mini-3B-2507-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-TurboQuant) — static calibrated KV-cache variant - [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit) — MLX weight-quantized 8-bit - [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit) — MLX weight-quantized 4-bit - [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit) — MLX weight-quantized 2-bit - [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) — upstream base model