majentik commited on
Commit
088389d
·
verified ·
1 Parent(s): 1b5c4d7

Add MLX quantized model

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +77 -0
  3. config.json +52 -0
  4. model.safetensors +3 -0
  5. tekken.json +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tekken.json filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Voxtral-Mini-4B-Realtime-2602
3
+ library_name: mlx
4
+ license: apache-2.0
5
+ pipeline_tag: automatic-speech-recognition
6
+ tags:
7
+ - voxtral
8
+ - audio
9
+ - speech
10
+ - speech-recognition
11
+ - realtime
12
+ - streaming
13
+ - asr
14
+ - mlx
15
+ - rotorquant
16
+ - quantization
17
+ - 2-bit
18
+ ---
19
+
20
+ # Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit
21
+
22
+ 2-bit MLX weight-quantized build of [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) with RotorQuant KV-cache. Ultra-compact real-time ASR for memory-constrained Apple Silicon — best-available 2-bit stability on streaming audio.
23
+
24
+ ## Overview
25
+
26
+ - **Base:** `mistralai/Voxtral-Mini-4B-Realtime-2602` — 4B real-time ASR model
27
+ - **Weight precision:** 2-bit (group-wise)
28
+ - **KV-cache profile:** RotorQuant
29
+ - **Approx. on-disk size:** ~1.2 GB
30
+ - **Runtime:** MLX on Apple Silicon
31
+
32
+ ## Quickstart
33
+
34
+ ```bash
35
+ pip install mlx-lm
36
+ ```
37
+
38
+ ```python
39
+ from mlx_lm import load, generate
40
+
41
+ model, tokenizer = load("majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-2bit")
42
+
43
+ for chunk in audio_stream():
44
+ prompt = tokenizer.apply_chat_template(
45
+ [{"role": "user", "content": [{"type": "audio", "path": chunk}]}],
46
+ add_generation_prompt=True,
47
+ )
48
+ emit(generate(model, tokenizer, prompt=prompt, max_tokens=32))
49
+ ```
50
+
51
+ ## Model specs
52
+
53
+ | Field | Value |
54
+ |---|---|
55
+ | Parameters | 4B |
56
+ | Weight bits | 2 |
57
+ | Group size | 32 |
58
+ | Cache profile | RotorQuant |
59
+ | Size on disk | ~1.2 GB |
60
+ | Target hardware | Apple Silicon (M1/M2/M3/M4) |
61
+ | License | Apache 2.0 |
62
+
63
+ ## RotorQuant vs TurboQuant
64
+
65
+ | | RotorQuant | TurboQuant |
66
+ |---|---|---|
67
+ | Strategy | Rotational online re-basis | Per-head static calibration |
68
+ | Memory reduction | ~4x on KV-cache | ~3.5x on KV-cache |
69
+ | Best for | Noisy/multi-speaker streams | Predictable domains, lowest p50 latency |
70
+
71
+ ## See also
72
+
73
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-4bit)
74
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant-MLX-8bit)
75
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant-MLX-2bit`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-TurboQuant-MLX-2bit)
76
+ - [`majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant`](https://huggingface.co/majentik/Voxtral-Mini-4B-Realtime-2602-RotorQuant) — KV-cache-only bundle
77
+ - [`mistralai/Voxtral-Mini-4B-Realtime-2602`](https://huggingface.co/mistralai/Voxtral-Mini-4B-Realtime-2602) — upstream base model
config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "voxtral_realtime",
3
+ "decoder": {
4
+ "dim": 3072,
5
+ "n_layers": 26,
6
+ "head_dim": 128,
7
+ "hidden_dim": 9216,
8
+ "n_heads": 32,
9
+ "n_kv_heads": 8,
10
+ "vocab_size": 131072,
11
+ "norm_eps": 1e-05,
12
+ "rope_theta": 1000000.0,
13
+ "sliding_window": 8192,
14
+ "tied_embeddings": true,
15
+ "ada_rms_norm_t_cond": true,
16
+ "ada_rms_norm_t_cond_dim": 32
17
+ },
18
+ "encoder_args": {
19
+ "audio_encoding_args": {
20
+ "sampling_rate": 16000,
21
+ "frame_rate": 12.5,
22
+ "num_mel_bins": 128,
23
+ "hop_length": 160,
24
+ "window_size": 400,
25
+ "chunk_length_s": null,
26
+ "global_log_mel_max": 1.5,
27
+ "transcription_format": "streaming"
28
+ },
29
+ "dim": 1280,
30
+ "n_layers": 32,
31
+ "head_dim": 64,
32
+ "hidden_dim": 5120,
33
+ "n_heads": 32,
34
+ "vocab_size": 131072,
35
+ "n_kv_heads": 32,
36
+ "use_biases": true,
37
+ "use_cache": false,
38
+ "rope_theta": 1000000.0,
39
+ "causal": true,
40
+ "norm_eps": 1e-05,
41
+ "pos_embed": "rope",
42
+ "max_source_positions": null,
43
+ "ffn_type": "swiglu",
44
+ "norm_type": "rms_norm",
45
+ "sliding_window": 750,
46
+ "downsample_factor": 4
47
+ },
48
+ "quantization_config": {
49
+ "bits": 2,
50
+ "group_size": 64
51
+ }
52
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aa1c4777e2d71e4db1b3542ad16494f443779f0a79a00219d0fd6d6bdb691c0b
3
+ size 1395968235
tekken.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8434af1d39eba99f0ef46cf1450bf1a63fa941a26933a1ef5dbbf4adf0d00e44
3
+ size 14910348