majentik commited on
Commit
a0e777e
·
verified ·
1 Parent(s): b07b440

Add model card

Browse files
Files changed (1) hide show
  1. README.md +93 -0
README.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Mistral-Small-4-119B-2603
3
+ library_name: transformers
4
+ license: apache-2.0
5
+ tags:
6
+ - rotorquant
7
+ - kv-cache-quantization
8
+ - mistral
9
+ - moe
10
+ - sparse-moe
11
+ - multimodal
12
+ - quantized
13
+ - 256k-context
14
+ - thinking
15
+ pipeline_tag: text-generation
16
+ ---
17
+
18
+ # Mistral-Small-4-119B-RotorQuant
19
+
20
+ **KV cache quantization for Mistral Small 4 using RotorQuant** -- 5.3x faster prefill, 28% faster decode, with near-lossless quality (perplexity 6.91 vs 7.07 baseline).
21
+
22
+ This repository provides RotorQuant KV cache quantization support for [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603). Model weights are unchanged (FP16); only the KV cache is quantized during inference.
23
+
24
+ ## Model Specs
25
+
26
+ | Property | Value |
27
+ |---|---|
28
+ | Base Model | Mistral Small 4 (March 2026) |
29
+ | Total Parameters | 119B |
30
+ | Active Parameters | 6.5B per token (Sparse MoE) |
31
+ | Architecture | Sparse MoE -- 128 experts, 4 active per token |
32
+ | Context Length | 256K tokens |
33
+ | Modality | Text + Images (multimodal) |
34
+ | Capabilities | Thinking / reasoning, tool use, multilingual |
35
+ | License | Apache 2.0 |
36
+ | Quantization | KV cache only (RotorQuant) |
37
+
38
+ ## What is RotorQuant?
39
+
40
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a rotation-based KV cache quantization method that applies learned rotations before quantizing the key-value cache. Key results:
41
+
42
+ - **5.3x faster prefill** compared to unquantized baseline
43
+ - **28% faster decode** throughput
44
+ - **Perplexity: 6.91** vs 7.07 for unquantized (lower is better -- RotorQuant actually improves quality due to outlier suppression)
45
+ - Default 3-bit quantization with minimal quality loss
46
+
47
+ ## Memory Estimates
48
+
49
+ | Component | FP16 Baseline | RotorQuant 3-bit |
50
+ |---|---|---|
51
+ | Model Weights | ~238 GB | ~238 GB |
52
+ | KV Cache (256K ctx) | ~32 GB | ~6.5 GB |
53
+ | **Total** | **~270 GB** | **~244.5 GB** |
54
+
55
+ > **Note:** This is a Sparse MoE model -- only 6.5B parameters are active per token, so inference is fast despite the 119B total parameter count.
56
+
57
+ ## Quickstart
58
+
59
+ ```python
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+ from turboquant import IsoQuantCache
62
+
63
+ model_id = "majentik/Mistral-Small-4-119B-RotorQuant"
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
66
+ model = AutoModelForCausalLM.from_pretrained(
67
+ model_id,
68
+ torch_dtype="auto",
69
+ device_map="auto",
70
+ )
71
+
72
+ # Enable RotorQuant KV cache
73
+ cache = IsoQuantCache(model)
74
+
75
+ messages = [
76
+ {"role": "user", "content": "Explain sparse mixture-of-experts architectures."}
77
+ ]
78
+
79
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
80
+ outputs = model.generate(
81
+ inputs,
82
+ max_new_tokens=512,
83
+ past_key_values=cache,
84
+ )
85
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
86
+ ```
87
+
88
+ ## See Also
89
+
90
+ - [mistralai/Mistral-Small-4-119B-2603](https://huggingface.co/mistralai/Mistral-Small-4-119B-2603) -- Base model
91
+ - [majentik/Mistral-Small-4-119B-TurboQuant](https://huggingface.co/majentik/Mistral-Small-4-119B-TurboQuant) -- TurboQuant KV cache variant
92
+ - [majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Mistral-Small-4-119B-RotorQuant-MLX-4bit) -- MLX 4-bit weight-quantized + RotorQuant
93
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)