--- base_model: mistralai/Voxtral-Mini-4B-Realtime-2602 library_name: transformers license: apache-2.0 pipeline_tag: automatic-speech-recognition tags: - voxtral - audio - speech - speech-recognition - realtime - streaming - asr - kv-cache - rotorquant - quantization --- # Voxtral-Mini-4B-Realtime-2602-RotorQuant RotorQuant KV-cache bundle for [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams. This artifact ships **only the quantized KV-cache** — weights load from upstream. ## Overview - **Base model:** `mistralai/Voxtral-Mini-4B-Realtime-2602` - **Capabilities:** real-time ASR, streaming speech-to-text - **Quantization target:** attention KV-cache only - **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session ## Quickstart ```python from transformers import VoxtralForConditionalGeneration, AutoProcessor from majentik_quant import RotorQuantCache model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602" processor = AutoProcessor.from_pretrained(model_id) model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto") cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant") for chunk in audio_stream(): inputs = processor(audio=chunk, return_tensors="pt") out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32) emit(processor.batch_decode(out, skip_special_tokens=True)[0]) ``` ## Model specs | Field | Value | |---|---| | Parameters | 4B | | Modality | Streaming audio-in, text-out | | Use case | Real-time ASR | | Cache quantization | RotorQuant (rotated int4) | | License | Apache 2.0 | ## RotorQuant vs TurboQuant | | RotorQuant | TurboQuant | |---|---|---| | Strategy | Rotational online re-basis | Per-head static calibration | | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache | | Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency | ## See also - [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant) - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit) - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit) - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit) - [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model