majentik's picture
Add model card
76ada0a verified
metadata
base_model: mistralai/Leanstral-2603
library_name: transformers
tags:
  - turboquant
  - kv-cache-quantization
  - leanstral
  - lean4
  - formal-proofs
  - theorem-proving
  - quantized
  - mistral
  - moe
license: apache-2.0

Leanstral-TurboQuant

KV-cache quantized Leanstral-2603 using TurboQuant for efficient Lean 4 formal proof generation.

Leanstral is the first open-source AI agent purpose-built for Lean 4 formal proofs -- generating both executable code and machine-checkable mathematical proofs. This variant applies TurboQuant KV-cache quantization to drastically reduce memory consumption during inference while preserving the full BF16 model weights.

Overview

This repository provides the TurboQuant KV-cache-only configuration of Leanstral-2603. The model weights remain at full precision; only the KV cache is quantized during inference, reducing peak memory usage for long-context Lean 4 proof generation sessions.

Spec Value
Base model mistralai/Leanstral-2603
Architecture Mistral MoE (~119B parameters, 7 consolidated shards)
Compression TurboQuant KV-cache quantization
Weight precision BF16 (unmodified)
KV-cache precision Mixed-precision quantized
License Apache 2.0
Use case Lean 4 formal verification, theorem proving, mathematical proofs

Quickstart

from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

model_id = "majentik/Leanstral-TurboQuant"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

# Enable TurboQuant KV-cache quantization
cache = TurboQuantCache(model)

prompt = "Prove that for all natural numbers n, n + 0 = n in Lean 4:"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    past_key_values=cache,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What is TurboQuant?

TurboQuant (arXiv: 2504.19874) is a KV-cache quantization method that compresses the key-value cache used during autoregressive generation. By quantizing the KV cache to lower precision, TurboQuant reduces memory consumption proportionally to context length without modifying the model weights themselves. This is particularly beneficial for Lean 4 proof generation, where proofs can span thousands of tokens.

Memory Estimates

Component Estimate
Model weights (BF16) ~238 GB
KV-cache savings 2-4x reduction vs FP16 KV cache
Recommended VRAM 4x A100 80GB or equivalent

Lean 4 Use Case

Leanstral excels at:

  • Formal verification -- generating machine-checkable proofs of mathematical theorems
  • Theorem proving -- interactive and automated proof search in Lean 4
  • Code generation -- writing verified Lean 4 programs with correctness guarantees
  • Proof repair -- fixing incomplete or broken proof scripts

See Also