| --- |
| base_model: mistralai/Voxtral-Mini-4B-Realtime-2602 |
| library_name: transformers |
| license: apache-2.0 |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - voxtral |
| - audio |
| - speech |
| - speech-recognition |
| - realtime |
| - streaming |
| - asr |
| - kv-cache |
| - rotorquant |
| - quantization |
| --- |
| |
| # Voxtral-Mini-4B-Realtime-2602-RotorQuant |
|
|
| RotorQuant KV-cache bundle for [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams. |
|
|
| This artifact ships **only the quantized KV-cache** — weights load from upstream. |
|
|
| ## Overview |
|
|
| - **Base model:** `mistralai/Voxtral-Mini-4B-Realtime-2602` |
| - **Capabilities:** real-time ASR, streaming speech-to-text |
| - **Quantization target:** attention KV-cache only |
| - **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session |
|
|
| ## Quickstart |
|
|
| ```python |
| from transformers import VoxtralForConditionalGeneration, AutoProcessor |
| from majentik_quant import RotorQuantCache |
| |
| model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602" |
| processor = AutoProcessor.from_pretrained(model_id) |
| model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto") |
| |
| cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant") |
| |
| for chunk in audio_stream(): |
| inputs = processor(audio=chunk, return_tensors="pt") |
| out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32) |
| emit(processor.batch_decode(out, skip_special_tokens=True)[0]) |
| ``` |
|
|
| ## Model specs |
|
|
| | Field | Value | |
| |---|---| |
| | Parameters | 4B | |
| | Modality | Streaming audio-in, text-out | |
| | Use case | Real-time ASR | |
| | Cache quantization | RotorQuant (rotated int4) | |
| | License | Apache 2.0 | |
|
|
| ## RotorQuant vs TurboQuant |
|
|
| | | RotorQuant | TurboQuant | |
| |---|---|---| |
| | Strategy | Rotational online re-basis | Per-head static calibration | |
| | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache | |
| | Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency | |
|
|
| ## See also |
|
|
| - [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant) |
| - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit) |
| - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit) |
| - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit) |
| - [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model |
|
|