KV-cache quantization without any fork (recommended, 2026): upstream llama.cpp/Ollama now cover this natively — use -ctk q8_0 -ctv q8_0 (~~half KV memory, negligible quality loss: perplexity +0.002–0.05) or -ctk q4_0 -ctv q4_0 (~~quarter memory, ≈7.6% perplexity increase). In Ollama: OLLAMA_KV_CACHE_TYPE=q8_0 with OLLAMA_FLASH_ATTENTION=1. Keep K and V types symmetric to stay on the fast fused Flash-Attention path. Since April 2026, mainline llama.cpp also applies Hadamard rotation to KV activations (PR #21038), which greatly improves low-bit KV quality (opt-out: LLAMA_ATTN_ROT_DISABLE=1).

The RotorQuant/TurboQuant fork flow below is experimental/legacy: the TurboQuant llama.cpp PR was closed without merging (June 2026) and the fork is unmaintained relative to mainline. It is NOT required to use this model.

Leanstral-TurboQuant-MLX-2bit

2-bit MLX weight-quantized Leanstral-2603 with TurboQuant KV-cache quantization for Lean 4 formal proof generation on Apple Silicon.

Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant combines dual compression: 2-bit MLX weight quantization for aggressive model size reduction plus TurboQuant KV-cache quantization for efficient long-context inference.

Overview

This repository provides a dual-compressed configuration: MLX 2-bit weight quantization aggressively reduces the static memory footprint, while TurboQuant compresses the KV cache at runtime. This enables running Leanstral on Apple Silicon machines with moderate unified memory.

Spec	Value
Base model	mistralai/Leanstral-2603
Architecture	Mistral MoE (~119B parameters, 7 consolidated shards)
Weight quantization	2-bit (MLX)
KV-cache quantization	TurboQuant
Weight memory	~30 GB
Runtime	MLX (Apple Silicon)
License	Apache 2.0
Use case	Lean 4 formal verification, theorem proving, mathematical proofs

Quickstart

from mlx_lm import load, generate

model, tokenizer = load("majentik/Leanstral-TurboQuant-MLX-2bit")

prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=512,
)
print(response)

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization method that compresses the key-value cache used during autoregressive generation. By quantizing the KV cache to lower precision, TurboQuant reduces memory consumption proportionally to context length. Combined with MLX 2-bit weight quantization, this aggressive dual compression makes it possible to fit Leanstral's ~119B parameter model into ~30 GB of unified memory.

Note: 2-bit weight quantization is lossy. Expect some degradation in proof quality compared to the 4-bit variant. For critical formal verification work, prefer the 4-bit or full-precision variants.

Memory Estimates

Component	Estimate
Model weights (2-bit)	~30 GB
KV-cache	Reduced via TurboQuant
Recommended hardware	MacBook Pro M2/M3/M4 Max (64 GB+) or Mac Studio

Lean 4 Use Case

Leanstral excels at:

Formal verification -- generating machine-checkable proofs of mathematical theorems
Theorem proving -- interactive and automated proof search in Lean 4
Code generation -- writing verified Lean 4 programs with correctness guarantees
Proof repair -- fixing incomplete or broken proof scripts

Quant trade-off (MLX lane)

Bits	Approx size	Use case	Recommendation
2-bit	~31 GB	Aggressive quantization	Very low-RAM Macs
3-bit	~43 GB	Lossy but small	Low-RAM Macs
4-bit	~50 GB	Balanced default	Recommended for most Macs
5-bit	~60 GB	Higher fidelity	Quality-sensitive
6-bit	~71 GB	Approaching FP16 quality	High-fidelity
8-bit	~90 GB	Near-lossless reference	Fidelity-critical work

(Current variant — 2bit — is bolded.)

Variants in this family

(Showing 8 sibling variants under majentik/leanstral-*. The current variant — TurboQuant-MLX-2bit — is bolded.)

Variant	Runtime	Approx size	Use case
RotorQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
RotorQuant-MLX-2bit	mlx-lm	card-only	Apple Silicon, smallest
RotorQuant-MLX-4bit	mlx-lm	card-only	Apple Silicon balanced
RotorQuant-MLX-8bit	mlx-lm	card-only	Apple Silicon reference
TurboQuant	runtime modifier	n/a	KV-cache root (weight-agnostic)
TurboQuant-MLX-2bit	mlx-lm	card-only	Apple Silicon, smallest
TurboQuant-MLX-4bit	mlx-lm	card-only	Apple Silicon balanced
TurboQuant-MLX-8bit	mlx-lm	card-only	Apple Silicon reference

Downloads last month: 324

Safetensors

Model size

23B params

Tensor type

BF16

U32

MLX

Hardware compatibility

2-bit

Model tree for majentik/Leanstral-TurboQuant-MLX-2bit

Base model

mistralai/Leanstral-2603

Quantized

(10)

this model

Paper for majentik/Leanstral-TurboQuant-MLX-2bit

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Paper • 2504.19874 • Published Apr 28, 2025 • 34

majentik
/

Leanstral-TurboQuant-MLX-2bit