README.md · majentik/Leanstral-RotorQuant-MLX-8bit at main

Leanstral-RotorQuant-MLX-8bit / README.md

majentik

Add MLX 8-bit quantized model with KV cache compression

cfb84ae verified 3 days ago

preview code

raw

history blame contribute delete

4.28 kB

	---
	base_model: mistralai/Leanstral-2603
	library_name: mlx
	tags:
	- rotorquant
	- kv-cache-quantization
	- mlx
	- 8bit
	- weight-quantization
	- leanstral
	- lean4
	- formal-proofs
	- theorem-proving
	- quantized
	- apple-silicon
	- mistral
	- moe
	license: apache-2.0
	pipeline_tag: text-generation
	---

	# Leanstral-RotorQuant-MLX-8bit

	8-bit MLX weight-quantized [Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) with [RotorQuant](https://github.com/scrya-com/rotorquant) KV-cache quantization for Lean 4 formal proof generation on Apple Silicon.

	Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant combines dual compression: 8-bit MLX weight quantization for reduced model size plus RotorQuant KV-cache quantization for efficient long-context inference with faster prefill and decode.

	Approximate model size: ~120 GB

	## Overview

	This repository provides a dual-compressed configuration: MLX 8-bit weight quantization reduces the static memory footprint, while RotorQuant compresses the KV cache at runtime with superior throughput. Together, they enable running Leanstral on high-memory Apple Silicon machines.

	\| Spec \| Value \|
	\|------\|-------\|
	\| Base model \| [mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) \|
	\| Architecture \| Mistral MoE (~119B parameters, 7 consolidated shards) \|
	\| Weight quantization \| 8-bit (MLX) \|
	\| KV-cache quantization \| RotorQuant \|
	\| Weight memory \| ~120 GB \|
	\| Runtime \| MLX (Apple Silicon) \|
	\| License \| Apache 2.0 \|
	\| Use case \| Lean 4 formal verification, theorem proving, mathematical proofs \|

	## Quickstart

	```python
	from mlx_lm import load, generate

	model, tokenizer = load("majentik/Leanstral-RotorQuant-MLX-8bit")

	prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
	response = generate(
	model,
	tokenizer,
	prompt=prompt,
	max_tokens=512,
	)
	print(response)
	```

	## What is RotorQuant?

	[RotorQuant](https://github.com/scrya-com/rotorquant) is a rotation-based KV cache quantization method that applies learned Clifford algebra rotations before quantizing the key-value cache. Key results:

	- 5.3x faster prefill compared to TurboQuant baseline
	- 28% faster decode throughput
	- Perplexity: 6.91 vs 7.07 for TurboQuant (lower is better)

	Combined with MLX 8-bit weight quantization, this dual compression approach makes it feasible to run Leanstral's ~119B parameter model on Apple Silicon hardware with excellent throughput.

	## KV-Cache Quantization Comparison

	\| Method \| Prefill Speed \| Decode Speed \| Memory Savings \| Reference \|
	\|---\|---\|---\|---\|---\|
	\| TurboQuant \| Baseline \| Baseline \| High \| [arXiv: 2504.19874](https://arxiv.org/abs/2504.19874) \|
	\| RotorQuant \| 5.3x faster \| 28% faster \| High \| [GitHub](https://github.com/scrya-com/rotorquant) \|

	## Memory Estimates

	\| Component \| Estimate \|
	\|-----------\|----------\|
	\| Model weights (8-bit) \| ~120 GB \|
	\| KV-cache \| Reduced via RotorQuant \|
	\| Recommended hardware \| Mac Studio M2/M3/M4 Ultra (192 GB+) or Mac Pro \|

	## Lean 4 Use Case

	Leanstral excels at:
	- Formal verification -- generating machine-checkable proofs of mathematical theorems
	- Theorem proving -- interactive and automated proof search in Lean 4
	- Code generation -- writing verified Lean 4 programs with correctness guarantees
	- Proof repair -- fixing incomplete or broken proof scripts

	## See Also

	- [mistralai/Leanstral-2603](https://huggingface.co/mistralai/Leanstral-2603) -- Base model
	- [majentik/Leanstral-RotorQuant](https://huggingface.co/majentik/Leanstral-RotorQuant) -- Full-precision weights + RotorQuant KV cache
	- [majentik/Leanstral-RotorQuant-MLX-4bit](https://huggingface.co/majentik/Leanstral-RotorQuant-MLX-4bit) -- MLX 4-bit + RotorQuant
	- [majentik/Leanstral-RotorQuant-MLX-2bit](https://huggingface.co/majentik/Leanstral-RotorQuant-MLX-2bit) -- MLX 2-bit + RotorQuant
	- [majentik/Leanstral-TurboQuant-MLX-8bit](https://huggingface.co/majentik/Leanstral-TurboQuant-MLX-8bit) -- MLX 8-bit + TurboQuant
	- [RotorQuant GitHub](https://github.com/scrya-com/rotorquant)
	- [MLX Framework](https://github.com/ml-explore/mlx)