majentik's picture
Add model card
f51e5b5 verified
---
base_model: mistralai/Voxtral-Mini-3B-2507
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- voxtral
- audio
- speech
- speech-recognition
- transcription
- translation
- kv-cache
- rotorquant
- quantization
---
# Voxtral-Mini-3B-2507-RotorQuant
RotorQuant KV-cache for [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507). Uses a rotational online re-basis of the attention cache that is robust to distributional drift across long, code-switched, or noisy audio streams.
This artifact ships **only the quantized KV-cache bundle** β€” model weights load from the upstream repo.
## Overview
- **Base model:** `mistralai/Voxtral-Mini-3B-2507`
- **Capabilities:** transcription, speech translation, audio understanding
- **Quantization target:** attention KV-cache only
- **Method:** RotorQuant β€” orthogonal rotation + low-bit quantization, refreshed per session
RotorQuant trades a tiny per-session calibration pass for better low-bit stability on streaming audio. Preferred when audio domains shift mid-stream (multi-speaker meetings, code-switching, noise bursts).
## Quickstart
```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import RotorQuantCache
model_id = "mistralai/Voxtral-Mini-3B-2507"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-3B-2507-RotorQuant")
inputs = processor(audio="meeting.wav", return_tensors="pt")
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=512)
print(processor.batch_decode(out, skip_special_tokens=True)[0])
```
## Model specs
| Field | Value |
|---|---|
| Parameters | 3B |
| Modality | Audio-in, text-out |
| Languages | Multilingual (24+) |
| Cache quantization | RotorQuant (rotated int4) |
| License | Apache 2.0 |
## RotorQuant vs TurboQuant
| | RotorQuant | TurboQuant |
|---|---|---|
| Strategy | Rotational online re-basis | Per-head static calibration |
| Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
| Best for | Streaming, code-switching audio | Batch transcription, fixed domains |
| Calibration cost | Per-session light re-basis | One-shot, fast |
## See also
- [`majentik/Voxtral-Mini-3B-2507-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-TurboQuant) β€” static calibrated KV-cache variant
- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-8bit) β€” MLX weight-quantized 8-bit
- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-4bit) β€” MLX weight-quantized 4-bit
- [`majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-3B-2507-RotorQuant-MLX-2bit) β€” MLX weight-quantized 2-bit
- [`mistralai/Voxtral-Mini-3B-2507`](https://huggingface.co/mistralai/Voxtral-Mini-3B-2507) β€” upstream base model