Instructions to use uw-math-ai/MathLeap-Octen-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use uw-math-ai/MathLeap-Octen-8B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("uw-math-ai/MathLeap-Octen-8B") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
MathLeap-Octen-8B
An 8B retrieval-tuned embedding model for mathematical text. Fine-tuned from Octen/Octen-Embedding-8B. It's a retrieval-specialized variant of Qwen3-Embedding-8B — on mathlib4 concepts via multi-view contrastive learning. The goal is to embed mathematically equivalent statements close together regardless of the surface language they are written in.
A companion model fine-tuned from the original Qwen3-Embedding-8B under the same recipe, MathLeap-Qwen-8B is also released.
Usage
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer
# Download model files from anonymous mirror
model_path = snapshot_download(
repo_id="anonymous-submission/MathLeap-Octen-8B",
endpoint="https://anonymous-hf.up.railway.app/a/9n9cngyu38hk/",
)
# Load locally
model = SentenceTransformer(model_path)
query = "For any natural number n, n + 0 = n."
docs = [
"theorem add_zero (n : â„•) : n + 0 = n := rfl",
"theorem mul_zero (n : â„•) : n * 0 = 0 := rfl",
]
q_emb = model.encode([query])
d_emb = model.encode(docs)
print(q_emb @ d_emb.T)
The model accepts queries and documents interchangeably across three forms:
informal natural-language descriptions, Lean 4 type signatures (e.g., α → α → α),
and full Lean 4 declarations (e.g., theorem foo (x : â„•) : ...).
Model details
| Base model | Octen/Octen-Embedding-8B (a retrieval-specialized fine-tune of Qwen3-Embedding-8B) |
| Parameters | ~8B (decoder-only transformer) |
| Embedding dimension | 4096 |
| Max sequence length | 128 (training); 32K from base |
| Pooling | Last-token |
| Similarity | Cosine, L2-normalized |
Training
Data
The training set is derived from FrenzyMath/mathlib_informal_v4.19.0 (Apache 2.0), augmented with LLM-generated natural-language rephrasings. Each mathlib concept has up to four parallel "views":
nl_informal: informal natural-language descriptionnl_informal_2: LLM-generated NL rephrasing (~85% of concepts)lean_type: Lean 4 type signaturelean_signature: full Lean 4 declaration
Rephrasings were generated by prompting an LLM to first decompose each statement into hypotheses and conclusions, then reassemble them with a different surface form (different variable names, sentence structure, voice, or quantifier ordering) while preserving mathematical content.
The full training set contains 118,334 concepts; held-out evaluation uses 15,287 concepts. Train/dev split is by mathlib module to prevent leakage of structurally related theorems.
Objective
Multi-view contrastive learning with CachedMultipleNegativesRankingLoss
(Gao et al., 2021). For each concept, the data loader samples a random view
pair as (anchor, positive); other concepts in the same batch serve as
in-batch negatives. The cached variant chunks forward passes so that 8B
models fit at batch_size 16 on a single 80GB GPU while remaining
mathematically equivalent to standard MNRL.
This setup implicitly covers all six retrieval directions across NL and Lean modalities (NL→Lean, Lean→NL, NL↔NL, Lean↔Lean) without explicit direction supervision.
Hyperparameters
| Loss | CachedMultipleNegativesRankingLoss |
| Cosine similarity scale | 20.0 |
| Optimizer | AdamW |
| Warmup steps | 700 |
| Scheduler | warmupconstant |
| Batch size | 16 |
| Max sequence length | 128 |
| Epochs | 2 (≈ 14,790 steps) |
| Seed | 42 |
| Train/dev split | Module-level group split, dev_frac = 0.1 |
| Train concepts | 118,334 |
| Held-out concepts | 15,287 |
Evaluation
In-domain held-out FrenzyMath retrieval (six directions)
R@1 / MRR on the held-out test set (15,287 concepts):
| Direction | R@1 base → FT | MRR base → FT |
|---|---|---|
| Informal → Lean type | 64.06 → 70.18 | 74.47 → 78.26 |
| Lean type → Informal | 58.17 → 66.89 | 69.90 → 75.57 |
| Informal → Lean signature | 56.08 → 67.24 | 66.28 → 76.28 |
| Lean signature → Informal | 57.44 → 72.77 | 68.48 → 81.37 |
| Lean type → Lean signature | 60.40 → 65.26 | 69.81 → 73.61 |
| Lean signature → Lean type | 70.04 → 73.34 | 78.45 → 80.56 |
All six directions improve, with the largest gains on Lean → Informal directions. Improvements stack on top of the strong retrieval-specialized starting point that Octen-Embedding-8B already provides.
Lean Blueprints retrieval
| Model | R@1 | R@5 | R@10 |
|---|---|---|---|
| Qwen3-Embedding-8B | 59.52 | 90.13 | 94.11 |
| Octen-Embedding-8B (base) | 64.32 | 92.60 | 96.37 |
| MathLeap-Qwen-8B | 64.73 | 93.10 | 96.44 |
| MathLeap-Octen-8B (this) | 64.69 | 92.80 | 96.92 |
MIRB (out-of-distribution; nDCG@10)
| Model | MO Duplicate | Mathlib Retrieval | Lean Premise |
|---|---|---|---|
| Qwen3-Embedding-8B | 0.859 | 0.559 | 0.136 |
| Octen-Embedding-8B (base) | 0.855 | 0.653 | 0.143 |
| MathLeap-Qwen-8B | 0.821 | 0.653 | 0.103 |
| MathLeap-Octen-8B (this) | 0.816 | 0.667 | 0.103 |
MathLeap-Octen-8B improves on Mathlib Retrieval (the task closest to the training distribution) but degrades on MathOverflow Duplicate retrieval (which involves conversational math beyond rigorous theorem statements) and Lean Premise retrieval (whose queries are proof states, not represented in our training data).
Intended use
- Semantic search over mathlib4 declarations (NL → Lean or Lean → NL)
- Cross-presentation similarity scoring for mathematical statements written in different mathematical sub-languages
- Embedding feature extraction for downstream math NLP tasks
- Retrieval-augmented generation in mathematical contexts
Out-of-scope use
- Open-ended math QA or proof generation (this is a retrieval model)
- Non-mathematical English text — heavily specialized, no general-purpose embedding ability guaranteed
- Safety-critical mathematical verification without independent checking
- Proof-state retrieval (training data does not include tactic-mode Lean)
- Languages other than English
- Cross-lingual mathematical retrieval (not evaluated)
Limitations
Same-distribution training: Trained on FrenzyMath-derived data with a particular informalization style. Performance on math text in substantially different registers (conversational math, textbook prose, research-paper exposition) may degrade. We observe this on MathOverflow Duplicate retrieval, where MathLeap-Octen-8B underperforms its base.
No proof-state coverage: Training data contains theorem statements (
lean_type,lean_signature) but not Lean proof states. On Lean Premise retrieval (MIRB), where queries are tactic-mode terms, MathLeap-Octen-8B underperforms its base.Stacked specialization: Since the base (Octen-Embedding-8B) is already a retrieval-specialized fine-tune of Qwen3-Embedding-8B, gains from MathLeap fine-tuning are smaller in absolute magnitude than for the Qwen-base variant. The trade-off pattern across tasks is similar.
Single-seed estimates: All reported numbers are from a single training run with seed 42; training variance is not characterized.
License
Apache 2.0, matching the upstream Octen-Embedding-8B base model and FrenzyMath training data.
Acknowledgments
Built on Octen-Embedding-8B (Octen Team). Training data derived from FrenzyMath's mathlib_informal release. Mathematics from the mathlib4 community.
- Downloads last month
- 26