Add model card

76ada0a verified 3 days ago

3.93 kB

base_model: mistralai/Leanstral-2603
library_name: transformers
tags:
  - turboquant
  - kv-cache-quantization
  - leanstral
  - lean4
  - formal-proofs
  - theorem-proving
  - quantized
  - mistral
  - moe
license: apache-2.0

Leanstral-TurboQuant

KV-cache quantized Leanstral-2603 using TurboQuant for efficient Lean 4 formal proof generation.

Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant applies TurboQuant KV-cache quantization to drastically reduce memory consumption during inference while preserving the full BF16 model weights.

Overview

This repository provides the TurboQuant KV-cache-only configuration of Leanstral-2603. The model weights remain at full precision; only the KV cache is quantized during inference, reducing peak memory usage for long-context Lean 4 proof generation sessions.

Spec	Value
Base model	mistralai/Leanstral-2603
Architecture	Mistral MoE (~119B parameters, 7 consolidated shards)
Compression	TurboQuant KV-cache quantization
Weight precision	BF16 (unmodified)
KV-cache precision	Mixed-precision quantized
License	Apache 2.0
Use case	Lean 4 formal verification, theorem proving, mathematical proofs

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model_id = "majentik/Leanstral-TurboQuant"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

# Enable TurboQuant KV-cache quantization
cache = TurboQuantCache(model)

prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization method that compresses the key-value cache used during autoregressive generation. By quantizing the KV cache to lower precision, TurboQuant reduces memory consumption proportionally to context length without modifying the model weights themselves. This is particularly beneficial for Lean 4 proof generation, where proofs can span thousands of tokens.

Memory Estimates

Component	Estimate
Model weights (BF16)	~238 GB
KV-cache savings	2-4x reduction vs FP16 KV cache
Recommended VRAM	4x A100 80GB or equivalent

Lean 4 Use Case

Leanstral excels at:

Formal verification -- generating machine-checkable proofs of mathematical theorems
Theorem proving -- interactive and automated proof search in Lean 4
Code generation -- writing verified Lean 4 programs with correctness guarantees
Proof repair -- fixing incomplete or broken proof scripts

majentik
/

Leanstral-TurboQuant