File size: 2,857 Bytes
aa0480e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | ---
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- voxtral
- audio
- speech
- speech-recognition
- realtime
- streaming
- asr
- kv-cache
- rotorquant
- quantization
---
# Voxtral-Mini-4B-Realtime-2602-RotorQuant
RotorQuant KV-cache bundle for [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams.
This artifact ships **only the quantized KV-cache** — weights load from upstream.
## Overview
- **Base model:** `mistralai/Voxtral-Mini-4B-Realtime-2602`
- **Capabilities:** real-time ASR, streaming speech-to-text
- **Quantization target:** attention KV-cache only
- **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session
## Quickstart
```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import RotorQuantCache
model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant")
for chunk in audio_stream():
inputs = processor(audio=chunk, return_tensors="pt")
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
emit(processor.batch_decode(out, skip_special_tokens=True)[0])
```
## Model specs
| Field | Value |
|---|---|
| Parameters | 4B |
| Modality | Streaming audio-in, text-out |
| Use case | Real-time ASR |
| Cache quantization | RotorQuant (rotated int4) |
| License | Apache 2.0 |
## RotorQuant vs TurboQuant
| | RotorQuant | TurboQuant |
|---|---|---|
| Strategy | Rotational online re-basis | Per-head static calibration |
| Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
| Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency |
## See also
- [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant)
- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit)
- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit)
- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit)
- [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model
|