Leanstral-TurboQuant
KV-cache quantized Leanstral-2603 using TurboQuant for efficient Lean 4 formal proof generation.
Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant applies TurboQuant KV-cache quantization to drastically reduce memory consumption during inference while preserving the full BF16 model weights.
Overview
This repository provides the TurboQuant KV-cache-only configuration of Leanstral-2603. The model weights remain at full precision; only the KV cache is quantized during inference, reducing peak memory usage for long-context Lean 4 proof generation sessions.
| Spec | Value |
|---|---|
| Base model | mistralai/Leanstral-2603 |
| Architecture | Mistral MoE (~119B parameters, 7 consolidated shards) |
| Compression | TurboQuant KV-cache quantization |
| Weight precision | BF16 (unmodified) |
| KV-cache precision | Mixed-precision quantized |
| License | Apache 2.0 |
| Use case | Lean 4 formal verification, theorem proving, mathematical proofs |
Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
model_id = "majentik/Leanstral-TurboQuant"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
# Enable TurboQuant KV-cache quantization
cache = TurboQuantCache(model)
prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
What is TurboQuant?
TurboQuant (arXiv: 2504.19874) is a KV-cache quantization method that compresses the key-value cache used during autoregressive generation. By quantizing the KV cache to lower precision, TurboQuant reduces memory consumption proportionally to context length without modifying the model weights themselves. This is particularly beneficial for Lean 4 proof generation, where proofs can span thousands of tokens.
Memory Estimates
| Component | Estimate |
|---|---|
| Model weights (BF16) | ~238 GB |
| KV-cache savings | 2-4x reduction vs FP16 KV cache |
| Recommended VRAM | 4x A100 80GB or equivalent |
Lean 4 Use Case
Leanstral excels at:
- Formal verification -- generating machine-checkable proofs of mathematical theorems
- Theorem proving -- interactive and automated proof search in Lean 4
- Code generation -- writing verified Lean 4 programs with correctness guarantees
- Proof repair -- fixing incomplete or broken proof scripts
See Also
- mistralai/Leanstral-2603 -- Base model
- majentik/Leanstral-RotorQuant -- RotorQuant KV-cache variant (faster prefill/decode)
- majentik/Leanstral-TurboQuant-MLX-4bit -- MLX 4-bit + TurboQuant
- majentik/Leanstral-TurboQuant-MLX-2bit -- MLX 2-bit + TurboQuant
- majentik/Leanstral-TurboQuant-MLX-1bit -- MLX 1-bit + TurboQuant
- TurboQuant paper
Model tree for majentik/Leanstral-TurboQuant
Base model
mistralai/Leanstral-2603