File size: 3,123 Bytes
f51e5b5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | ---
base_model: mistralai/Voxtral-Mini-3B-2507
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- voxtral
- audio
- speech
- speech-recognition
- transcription
- translation
- kv-cache
- rotorquant
- quantization
---
# Voxtral-Mini-3B-2507-RotorQuant
RotorQuant KV-cache for [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507). Uses a rotational online re-basis of the attention cache that is robust to distributional drift across long, code-switched, or noisy audio streams.
This artifact ships **only the quantized KV-cache bundle** — model weights load from the upstream repo.
## Overview
- **Base model:** `mistralai/Voxtral-Mini-3B-2507`
- **Capabilities:** transcription, speech translation, audio understanding
- **Quantization target:** attention KV-cache only
- **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session
RotorQuant trades a tiny per-session calibration pass for better low-bit stability on streaming audio. Preferred when audio domains shift mid-stream (multi-speaker meetings, code-switching, noise bursts).
## Quickstart
```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import RotorQuantCache
model_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-RotorQuant")
inputs = processor(audio="meeting.wav", return_tensors="pt")
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=512)
print(processor.batch_decode(out, skip_special_tokens=True)[0])
```
## Model specs
| Field | Value |
|---|---|
| Parameters | 3B |
| Modality | Audio-in, text-out |
| Languages | Multilingual (24+) |
| Cache quantization | RotorQuant (rotated int4) |
| License | Apache 2.0 |
## RotorQuant vs TurboQuant
| | RotorQuant | TurboQuant |
|---|---|---|
| Strategy | Rotational online re-basis | Per-head static calibration |
| Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
| Best for | Streaming, code-switching audio | Batch transcription, fixed domains |
| Calibration cost | Per-session light re-basis | One-shot, fast |
## See also
- [`majentik/Voxtral-Mini-3B-2507-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-TurboQuant) — static calibrated KV-cache variant
- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit) — MLX weight-quantized 8-bit
- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit) — MLX weight-quantized 4-bit
- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit) — MLX weight-quantized 2-bit
- [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) — upstream base model
|