File size: 2,857 Bytes
aa0480e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
library_name: transformers
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - voxtral
  - audio
  - speech
  - speech-recognition
  - realtime
  - streaming
  - asr
  - kv-cache
  - rotorquant
  - quantization
---

# Voxtral-Mini-4B-Realtime-2602-RotorQuant

RotorQuant KV-cache bundle for [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams.

This artifact ships **only the quantized KV-cache** — weights load from upstream.

## Overview

- **Base model:** `mistralai/Voxtral-Mini-4B-Realtime-2602`
- **Capabilities:** real-time ASR, streaming speech-to-text
- **Quantization target:** attention KV-cache only
- **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session

## Quickstart

```python
from transformers import VoxtralForConditionalGeneration, AutoProcessor
from majentik_quant import RotorQuantCache

model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
processor = AutoProcessor.from_pretrained(model_id)
model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")

cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant")

for chunk in audio_stream():
    inputs = processor(audio=chunk, return_tensors="pt")
    out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
    emit(processor.batch_decode(out, skip_special_tokens=True)[0])
```

## Model specs

| Field | Value |
|---|---|
| Parameters | 4B |
| Modality | Streaming audio-in, text-out |
| Use case | Real-time ASR |
| Cache quantization | RotorQuant (rotated int4) |
| License | Apache 2.0 |

## RotorQuant vs TurboQuant

| | RotorQuant | TurboQuant |
|---|---|---|
| Strategy | Rotational online re-basis | Per-head static calibration |
| Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
| Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency |

## See also

- [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant)
- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit)
- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit)
- [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit)
- [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model