majentik commited on
Commit
aa0480e
·
verified ·
1 Parent(s): 371683c

Add model card

Browse files
Files changed (1) hide show
  1. README.md +74 -0
README.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ pipeline_tag: automatic-speech-recognition
6
+ tags:
7
+ - voxtral
8
+ - audio
9
+ - speech
10
+ - speech-recognition
11
+ - realtime
12
+ - streaming
13
+ - asr
14
+ - kv-cache
15
+ - rotorquant
16
+ - quantization
17
+ ---
18
+
19
+ # Voxtral-Mini-4B-Realtime-2602-RotorQuant
20
+
21
+ RotorQuant KV-cache bundle for [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602). Rotational online re-basis of the attention cache — preferred for noisy, multi-speaker, or code-switching real-time streams.
22
+
23
+ This artifact ships **only the quantized KV-cache** — weights load from upstream.
24
+
25
+ ## Overview
26
+
27
+ - **Base model:** `mistralai/Voxtral-Mini-4B-Realtime-2602`
28
+ - **Capabilities:** real-time ASR, streaming speech-to-text
29
+ - **Quantization target:** attention KV-cache only
30
+ - **Method:** RotorQuant — orthogonal rotation + low-bit quantization, refreshed per session
31
+
32
+ ## Quickstart
33
+
34
+ ```python
35
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
36
+ from majentik_quant import RotorQuantCache
37
+
38
+ model_id = "mistralai/Voxtral-Mini-4B-Realtime-2602"
39
+ processor = AutoProcessor.from_pretrained(model_id)
40
+ model = VoxtralForConditionalGeneration.from_pretrained(model_id, torch_dtype="auto")
41
+
42
+ cache = RotorQuantCache.from_pretrained("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant")
43
+
44
+ for chunk in audio_stream():
45
+ inputs = processor(audio=chunk, return_tensors="pt")
46
+ out = model.generate(**inputs, past_key_values=cache, max_new_tokens=32)
47
+ emit(processor.batch_decode(out, skip_special_tokens=True)[0])
48
+ ```
49
+
50
+ ## Model specs
51
+
52
+ | Field | Value |
53
+ |---|---|
54
+ | Parameters | 4B |
55
+ | Modality | Streaming audio-in, text-out |
56
+ | Use case | Real-time ASR |
57
+ | Cache quantization | RotorQuant (rotated int4) |
58
+ | License | Apache 2.0 |
59
+
60
+ ## RotorQuant vs TurboQuant
61
+
62
+ | | RotorQuant | TurboQuant |
63
+ |---|---|---|
64
+ | Strategy | Rotational online re-basis | Per-head static calibration |
65
+ | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
66
+ | Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency |
67
+
68
+ ## See also
69
+
70
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant)
71
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit)
72
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit)
73
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit)
74
+ - [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model