majentik commited on
Commit
4b447bf
·
verified ·
1 Parent(s): d3d7ddb

Add model card (weights pending mlx_lm mistral3 architecture support)

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Mistral-Small-4-119B-2603
3
+ library_name: mlx
4
+ license: apache-2.0
5
+ tags:
6
+ - rotorquant
7
+ - kv-cache-quantization
8
+ - mistral
9
+ - moe
10
+ - sparse-moe
11
+ - multimodal
12
+ - quantized
13
+ - mlx
14
+ - 2-bit
15
+ - apple-silicon
16
+ - 256k-context
17
+ - thinking
18
+ pipeline_tag: text-generation
19
+ ---
20
+
21
+ # Mistral-Small-4-119B-RotorQuant-MLX-2bit
22
+
23
+ **Dual compression: 2-bit MLX weight quantization + RotorQuant KV cache quantization** for Mistral Small 4 on Apple Silicon.
24
+
25
+ This repository provides a 2-bit weight-quantized MLX conversion of [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) with RotorQuant KV cache quantization support. Aggressive compression for running on consumer Apple Silicon hardware.
26
+
27
+ ## Overview
28
+
29
+ This model applies two complementary compression techniques:
30
+
31
+ 1. **2-bit weight quantization (MLX)** -- reduces model weights from ~238 GB to ~30 GB
32
+ 2. **RotorQuant KV cache quantization** -- reduces KV cache from ~32 GB to ~6.5 GB at 256K context
33
+
34
+ This enables running a 119B-parameter MoE model on Apple Silicon Macs with 64 GB+ unified memory.
35
+
36
+ ## Model Specs
37
+
38
+ | Property | Value |
39
+ |---|---|
40
+ | Base Model | Mistral Small 4 (March 2026) |
41
+ | Total Parameters | 119B |
42
+ | Active Parameters | 6.5B per token (Sparse MoE) |
43
+ | Architecture | Sparse MoE -- 128 experts, 4 active per token |
44
+ | Context Length | 256K tokens |
45
+ | Modality | Text + Images (multimodal) |
46
+ | Capabilities | Thinking / reasoning, tool use, multilingual |
47
+ | License | Apache 2.0 |
48
+ | Weight Quantization | 2-bit (MLX) |
49
+ | KV Cache Quantization | RotorQuant 3-bit |
50
+
51
+ ## Memory Estimates
52
+
53
+ | Configuration | Weights | KV Cache (256K) | Total |
54
+ |---|---|---|---|
55
+ | FP16 baseline | ~238 GB | ~32 GB | ~270 GB |
56
+ | **This model (2-bit MLX + RotorQuant)** | **~30 GB** | **~6.5 GB** | **~36.5 GB** |
57
+
58
+ > **Note:** This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count. The 2-bit quantization trades some quality for significantly reduced memory. Expect modest degradation on complex reasoning tasks compared to 4-bit.
59
+
60
+ ## Quickstart
61
+
62
+ ```python
63
+ from mlx_lm import load, generate
64
+
65
+ model, tokenizer = load("majentik/Mistral-Small-4-119B-RotorQuant-MLX-2bit")
66
+
67
+ prompt = "Explain sparse mixture-of-experts architectures."
68
+ messages = [{"role": "user", "content": prompt}]
69
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
70
+
71
+ response = generate(model, tokenizer, prompt=text, max_tokens=512)
72
+ print(response)
73
+ ```
74
+
75
+ ## What is RotorQuant?
76
+
77
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results on the base model:
78
+
79
+ - **5.3x faster prefill** compared to unquantized baseline
80
+ - **28% faster decode** throughput
81
+ - **Perplexity: 6.91** vs 7.07 for unquantized (lower is better)
82
+
83
+ Because it targets the KV cache rather than weights, it stacks with weight quantization for compounding memory savings.
84
+
85
+ ## See Also
86
+
87
+ - [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) -- Base model
88
+ - [majentik/Mistral-Small-4-119B-RotorQuant](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant) -- KV cache only (no weight quantization)
89
+ - [majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit) -- 4-bit MLX variant
90
+ - [majentik/Mistral-Small-4-119B-RotorQuant-MLX-1bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-1bit) -- 1-bit MLX variant
91
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)