majentik commited on
Commit
fe57fd2
Β·
verified Β·
1 Parent(s): feab54a

Update KV-cache card with accurate template and fork requirements

Browse files
Files changed (1) hide show
  1. README.md +94 -57
README.md CHANGED
@@ -1,101 +1,138 @@
1
  ---
 
2
  base_model: mistralai/Leanstral-2603
3
- library_name: transformers
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
7
- - leanstral
 
8
  - lean4
9
  - formal-proofs
10
- - theorem-proving
11
  - quantized
12
- - mistral
13
- - moe
14
- license: apache-2.0
15
  ---
16
 
17
  # Leanstral-RotorQuant
18
 
19
- **KV-cache quantized [Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) using [RotorQuant](https://github.com/scrya-com/rotorquant) for high-throughput Lean 4 formal proof generation.**
 
 
20
 
21
- Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant applies RotorQuant KV-cache quantization, delivering **5.3x faster prefill** and **28% faster decode** compared to TurboQuant while preserving full BF16 model weights.
22
 
23
- ## Overview
24
 
25
- This repository provides the **RotorQuant KV-cache-only** configuration of Leanstral-2603. The model weights remain at full precision; only the KV cache is quantized during inference using RotorQuant's rotation-aware quantization scheme.
 
 
 
26
 
27
- | Spec | Value |
28
- |------|-------|
29
- | Base model | [mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) |
30
- | Architecture | Mistral MoE (~119B parameters, 7 consolidated shards) |
31
- | Compression | RotorQuant KV-cache quantization |
32
- | Weight precision | BF16 (unmodified) |
33
- | KV-cache precision | Mixed-precision quantized |
34
- | Prefill speedup | 5.3x vs TurboQuant |
35
- | Decode speedup | 28% vs TurboQuant |
36
- | License | Apache 2.0 |
37
- | Use case | Lean 4 formal verification, theorem proving, mathematical proofs |
38
 
39
  ## Quickstart
40
 
 
 
 
 
 
 
 
 
 
 
41
  ```python
 
42
  from transformers import AutoModelForCausalLM, AutoTokenizer
43
- from turboquant import IsoQuantCache
44
 
45
- model_id = "majentik/Leanstral-RotorQuant"
46
-
47
- tokenizer = AutoTokenizer.from_pretrained(model_id)
48
  model = AutoModelForCausalLM.from_pretrained(
49
- model_id,
 
50
  device_map="auto",
51
- torch_dtype="auto",
52
  )
53
 
54
- # Enable RotorQuant KV-cache quantization
55
- cache = IsoQuantCache(model)
56
-
57
- prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
58
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
59
 
 
60
  outputs = model.generate(
61
  **inputs,
62
- max_new_tokens=512,
63
  past_key_values=cache,
 
64
  )
65
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
66
  ```
67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  ## What is RotorQuant?
69
 
70
- [RotorQuant](https://github.com/scrya-com/rotorquant) is an advanced KV-cache quantization method that leverages rotation-aware quantization to achieve superior throughput compared to standard KV-cache compression. By exploiting the rotary positional embedding structure, RotorQuant achieves:
 
 
 
 
 
 
 
71
 
72
- - **5.3x faster prefill** -- critical for long Lean 4 proof contexts
73
- - **28% faster decode** -- faster token-by-token proof generation
74
- - Equivalent memory savings to TurboQuant with better computational efficiency
75
 
76
- This makes RotorQuant the preferred choice for interactive theorem proving sessions where latency matters.
77
 
78
- ## Memory Estimates
 
 
 
 
 
 
 
 
79
 
80
- | Component | Estimate |
81
- |-----------|----------|
82
- | Model weights (BF16) | ~238 GB |
83
- | KV-cache savings | 2-4x reduction vs FP16 KV cache |
84
- | Recommended VRAM | 4x A100 80GB or equivalent |
85
 
86
- ## Lean 4 Use Case
87
 
88
- Leanstral excels at:
89
- - **Formal verification** -- generating machine-checkable proofs of mathematical theorems
90
- - **Theorem proving** -- interactive and automated proof search in Lean 4
91
- - **Code generation** -- writing verified Lean 4 programs with correctness guarantees
92
- - **Proof repair** -- fixing incomplete or broken proof scripts
93
 
94
  ## See Also
95
 
96
- - [mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) -- Base model
97
- - [majentik/Leanstral-TurboQuant](https://huggingface.co/majentik/Leanstral-TurboQuant) -- TurboQuant KV-cache variant
98
- - [majentik/Leanstral-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Leanstral-RotorQuant-MLX-4bit) -- MLX 4-bit + RotorQuant
99
- - [majentik/Leanstral-RotorQuant-MLX-2bit](https://huggingface.co/majentik/Leanstral-RotorQuant-MLX-2bit) -- MLX 2-bit + RotorQuant
100
- - [majentik/Leanstral-RotorQuant-MLX-1bit](https://huggingface.co/majentik/Leanstral-RotorQuant-MLX-1bit) -- MLX 1-bit + RotorQuant
101
- - [RotorQuant repository](https://github.com/scrya-com/rotorquant)
 
1
  ---
2
+ license: apache-2.0
3
  base_model: mistralai/Leanstral-2603
 
4
  tags:
5
  - rotorquant
6
  - kv-cache-quantization
7
+ - mistral
8
+ - moe
9
  - lean4
10
  - formal-proofs
 
11
  - quantized
12
+ library_name: transformers
13
+ pipeline_tag: text-generation
 
14
  ---
15
 
16
  # Leanstral-RotorQuant
17
 
18
+ **RotorQuant KV cache compression** for [mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603).
19
+
20
+ This is a **documentation repository** that explains how to combine Leanstral's weights with RotorQuant inference-time KV cache compression. No weights are stored here β€” use the base model directly and apply RotorQuant via the Python package or llama.cpp fork.
21
 
22
+ ## What is this?
23
 
24
+ KV cache compression reduces the memory used by the attention cache during inference. Unlike weight quantization (which is baked into the GGUF/MLX file), KV cache compression is applied at runtime β€” so the same base weights can be used with or without compression.
25
 
26
+ | Technique | Where it's applied | Savings |
27
+ |-----------|-------------------|---------|
28
+ | Weight quantization (GGUF/MLX/AWQ) | Baked into model file | Reduces disk + weight memory |
29
+ | **RotorQuant KV cache** | At inference time | Reduces attention memory (critical for long context) |
30
 
31
+ Both can be combined for maximum efficiency.
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Quickstart
34
 
35
+ ### Option A β€” Python / transformers
36
+
37
+ Install the `rotorquant` package:
38
+
39
+ ```bash
40
+ pip install rotorquant
41
+ ```
42
+
43
+ Then use it with the base model:
44
+
45
  ```python
46
+ import torch
47
  from transformers import AutoModelForCausalLM, AutoTokenizer
48
+ from rotorquant import IsoQuantCache
49
 
50
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Leanstral-2603", trust_remote_code=True)
 
 
51
  model = AutoModelForCausalLM.from_pretrained(
52
+ "mistralai/Leanstral-2603",
53
+ torch_dtype=torch.bfloat16,
54
  device_map="auto",
55
+ trust_remote_code=True,
56
  )
57
 
58
+ # Apply RotorQuant to the KV cache
59
+ cache = IsoQuantCache(bits=4) # or bits=2 for more aggressive compression
 
 
 
60
 
61
+ inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
62
  outputs = model.generate(
63
  **inputs,
64
+ max_new_tokens=128,
65
  past_key_values=cache,
66
+ use_cache=True,
67
  )
68
+ print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
69
  ```
70
 
71
+
72
+ ### Option B β€” llama.cpp / LM Studio / Ollama (with fork)
73
+
74
+ RotorQuant KV cache types (`iso3`) are **not** in upstream llama.cpp. They require:
75
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
76
+
77
+ Once built:
78
+
79
+ ```bash
80
+ llama-cli -m Leanstral.gguf \
81
+ --cache-type-k iso3 --cache-type-v iso3 \
82
+ -ngl 99 -fa \
83
+ -p "Hello"
84
+ ```
85
+
86
+ For standard runtimes (LM Studio, Ollama, upstream llama.cpp), use conventional KV cache types (`q8_0`, `q4_0`). You lose the RotorQuant-specific benefits but keep GGUF weight quantization.
87
+
88
+ ## Model Specifications
89
+
90
+ | Property | Value |
91
+ |----------|-------|
92
+ | Base Model | [mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) |
93
+ | Architecture | Sparse MoE (128 experts, 4 active) |
94
+ | Parameters | 119B total (MoE) |
95
+ | Context Length | 256K |
96
+ | BF16 Size | ~238 GB |
97
+ | Modalities | Text |
98
+ | License | apache-2.0 |
99
+
100
  ## What is RotorQuant?
101
 
102
+ [RotorQuant](https://github.com/scrya-com/rotorquant) is a KV cache compression method based on Clifford algebra (Cl(3,0)) rotors β€” a faster, more parameter-efficient alternative to Google's TurboQuant. Uses lightweight block-diagonal rotations (independent 2D/4D rotations per pair/quartet) achieving O(d) complexity instead of O(d log d), fully parallelisable with no inter-element dependencies.
103
+
104
+ **Benchmarks** (from the RotorQuant repository, Llama 3.1 8B on RTX 5090 β€” results vary by model and hardware):
105
+
106
+ - Prefill: 3,822 tok/s (vs TurboQuant 722 tok/s)
107
+ - Decode: 119 tok/s (vs TurboQuant 93 tok/s)
108
+ - Perplexity: 6.91 (vs TurboQuant 7.07)
109
+ - Parameters: 4 per rotor (vs TurboQuant 16,384)
110
 
111
+ > Benchmarks are from the RotorQuant repository using Llama 3.1 8B. Performance on Leanstral will differ. Please open a discussion if you have independent results.
 
 
112
 
113
+ ## Current Ecosystem Support
114
 
115
+ | Runtime | RotorQuant Support | Notes |
116
+ |---------|----------------------|-------|
117
+ | Python transformers + `rotorquant` | βœ… Full | Drop-in cache class |
118
+ | llama.cpp upstream | ❌ Not merged | Use fork below |
119
+ | llama-cpp-turboquant fork | βœ… `planar3`, `iso3` | [GitHub](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache) |
120
+ | LM Studio | ❌ [Requested](https://github.com/lmstudio-ai/lmstudio-bug-tracker/issues/1719) | Use `q8_0` as alternative |
121
+ | Ollama | ❌ Not supported | Use `OLLAMA_KV_CACHE_TYPE=q8_0` |
122
+ | vLLM | ❌ Not supported | β€” |
123
+ | koboldcpp | ❌ Not supported | β€” |
124
 
125
+ ## Pre-quantized weight variants
 
 
 
 
126
 
127
+ If you want combined weight + KV cache compression, majentik hosts pre-quantized versions:
128
 
129
+ - [MLX (Apple Silicon)](https://huggingface.co/majentik?search=Leanstral+MLX)
130
+ - [GGUF (llama.cpp / Ollama / LM Studio)](https://huggingface.co/majentik?search=Leanstral+GGUF)
 
 
 
131
 
132
  ## See Also
133
 
134
+ - [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
135
+ - [TurboQuant paper (arXiv 2504.19874)](https://arxiv.org/abs/2504.19874)
136
+ - [llama-cpp-turboquant fork](https://github.com/johndpope/llama-cpp-turboquant/tree/feature/planarquant-kv-cache)
137
+ - [Base model: mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603)
138
+ - [Leanstral announcement](https://mistral.ai/news/leanstral)