Voxtral-Mini-4B-Realtime-2602-RotorQuant
RotorQuant KV-cache bundle for mistralai/Voxtral-Mini-4B-Realtime-2602. Rotational online re-basis of the attention cache โ preferred for noisy, multi-speaker, or code-switching real-time streams.
This artifact ships only the quantized KV-cache โ weights load from upstream.
Overview
- Base model:
mistralai/Voxtral-Mini-4B-Realtime-2602 - Capabilities: real-time ASR, streaming speech-to-text
- Quantization target: attention KV-cache only
- Method: RotorQuant โ orthogonal rotation + low-bit quantization, refreshed per session
Quickstart
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import RotorQuantCache
model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant")
for chunk in audio_stream():
inputs = processor(audio=chunk, return_tensors="pt")
out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
emit(processor.batch_decode(out, skip_special_tokens=True)[0])
Model specs
| Field | Value |
|---|---|
| Parameters | 4B |
| Modality | Streaming audio-in, text-out |
| Use case | Real-time ASR |
| Cache quantization | RotorQuant (rotated int4) |
| License | Apache 2.0 |
RotorQuant vs TurboQuant
| RotorQuant | TurboQuant | |
|---|---|---|
| Strategy | Rotational online re-basis | Per-head static calibration |
| Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
| Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency |
See also
Model tree for majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant
Base model
mistralai/Ministral-3-3B-Base-2512 Finetuned
mistralai/Voxtral-Mini-4B-Realtime-2602