MathLeap-Octen-8B

An 8B retrieval-tuned embedding model for mathematical text. Fine-tuned from Octen/Octen-Embedding-8B. It's a retrieval-specialized variant of Qwen3-Embedding-8B — on mathlib4 concepts via multi-view contrastive learning. The goal is to embed mathematically equivalent statements close together regardless of the surface language they are written in.

A companion model fine-tuned from the original Qwen3-Embedding-8B under the same recipe, MathLeap-Qwen-8B is also released.

Usage

from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

# Download model files from anonymous mirror
model_path = snapshot_download(
    repo_id="anonymous-submission/MathLeap-Octen-8B",
    endpoint="https://anonymous-hf.up.railway.app/a/9n9cngyu38hk/", 
)

# Load locally
model = SentenceTransformer(model_path)

query = "For any natural number n, n + 0 = n."
docs = [
    "theorem add_zero (n : â„•) : n + 0 = n := rfl",
    "theorem mul_zero (n : â„•) : n * 0 = 0 := rfl",
]
q_emb = model.encode([query])
d_emb = model.encode(docs)
print(q_emb @ d_emb.T)

The model accepts queries and documents interchangeably across three forms: informal natural-language descriptions, Lean 4 type signatures (e.g., α → α → α), and full Lean 4 declarations (e.g., theorem foo (x : ℕ) : ...).

Model details

Base model Octen/Octen-Embedding-8B (a retrieval-specialized fine-tune of Qwen3-Embedding-8B)
Parameters ~8B (decoder-only transformer)
Embedding dimension 4096
Max sequence length 128 (training); 32K from base
Pooling Last-token
Similarity Cosine, L2-normalized

Training

Data

The training set is derived from FrenzyMath/mathlib_informal_v4.19.0 (Apache 2.0), augmented with LLM-generated natural-language rephrasings. Each mathlib concept has up to four parallel "views":

  • nl_informal: informal natural-language description
  • nl_informal_2: LLM-generated NL rephrasing (~85% of concepts)
  • lean_type: Lean 4 type signature
  • lean_signature: full Lean 4 declaration

Rephrasings were generated by prompting an LLM to first decompose each statement into hypotheses and conclusions, then reassemble them with a different surface form (different variable names, sentence structure, voice, or quantifier ordering) while preserving mathematical content.

The full training set contains 118,334 concepts; held-out evaluation uses 15,287 concepts. Train/dev split is by mathlib module to prevent leakage of structurally related theorems.

Objective

Multi-view contrastive learning with CachedMultipleNegativesRankingLoss (Gao et al., 2021). For each concept, the data loader samples a random view pair as (anchor, positive); other concepts in the same batch serve as in-batch negatives. The cached variant chunks forward passes so that 8B models fit at batch_size 16 on a single 80GB GPU while remaining mathematically equivalent to standard MNRL.

This setup implicitly covers all six retrieval directions across NL and Lean modalities (NL→Lean, Lean→NL, NL↔NL, Lean↔Lean) without explicit direction supervision.

Hyperparameters

Loss CachedMultipleNegativesRankingLoss
Cosine similarity scale 20.0
Optimizer AdamW
Warmup steps 700
Scheduler warmupconstant
Batch size 16
Max sequence length 128
Epochs 2 (≈ 14,790 steps)
Seed 42
Train/dev split Module-level group split, dev_frac = 0.1
Train concepts 118,334
Held-out concepts 15,287

Evaluation

In-domain held-out FrenzyMath retrieval (six directions)

R@1 / MRR on the held-out test set (15,287 concepts):

Direction R@1 base → FT MRR base → FT
Informal → Lean type 64.06 → 70.18 74.47 → 78.26
Lean type → Informal 58.17 → 66.89 69.90 → 75.57
Informal → Lean signature 56.08 → 67.24 66.28 → 76.28
Lean signature → Informal 57.44 → 72.77 68.48 → 81.37
Lean type → Lean signature 60.40 → 65.26 69.81 → 73.61
Lean signature → Lean type 70.04 → 73.34 78.45 → 80.56

All six directions improve, with the largest gains on Lean → Informal directions. Improvements stack on top of the strong retrieval-specialized starting point that Octen-Embedding-8B already provides.

Lean Blueprints retrieval

Model R@1 R@5 R@10
Qwen3-Embedding-8B 59.52 90.13 94.11
Octen-Embedding-8B (base) 64.32 92.60 96.37
MathLeap-Qwen-8B 64.73 93.10 96.44
MathLeap-Octen-8B (this) 64.69 92.80 96.92

MIRB (out-of-distribution; nDCG@10)

Model MO Duplicate Mathlib Retrieval Lean Premise
Qwen3-Embedding-8B 0.859 0.559 0.136
Octen-Embedding-8B (base) 0.855 0.653 0.143
MathLeap-Qwen-8B 0.821 0.653 0.103
MathLeap-Octen-8B (this) 0.816 0.667 0.103

MathLeap-Octen-8B improves on Mathlib Retrieval (the task closest to the training distribution) but degrades on MathOverflow Duplicate retrieval (which involves conversational math beyond rigorous theorem statements) and Lean Premise retrieval (whose queries are proof states, not represented in our training data).

Intended use

  • Semantic search over mathlib4 declarations (NL → Lean or Lean → NL)
  • Cross-presentation similarity scoring for mathematical statements written in different mathematical sub-languages
  • Embedding feature extraction for downstream math NLP tasks
  • Retrieval-augmented generation in mathematical contexts

Out-of-scope use

  • Open-ended math QA or proof generation (this is a retrieval model)
  • Non-mathematical English text — heavily specialized, no general-purpose embedding ability guaranteed
  • Safety-critical mathematical verification without independent checking
  • Proof-state retrieval (training data does not include tactic-mode Lean)
  • Languages other than English
  • Cross-lingual mathematical retrieval (not evaluated)

Limitations

  1. Same-distribution training: Trained on FrenzyMath-derived data with a particular informalization style. Performance on math text in substantially different registers (conversational math, textbook prose, research-paper exposition) may degrade. We observe this on MathOverflow Duplicate retrieval, where MathLeap-Octen-8B underperforms its base.

  2. No proof-state coverage: Training data contains theorem statements (lean_type, lean_signature) but not Lean proof states. On Lean Premise retrieval (MIRB), where queries are tactic-mode terms, MathLeap-Octen-8B underperforms its base.

  3. Stacked specialization: Since the base (Octen-Embedding-8B) is already a retrieval-specialized fine-tune of Qwen3-Embedding-8B, gains from MathLeap fine-tuning are smaller in absolute magnitude than for the Qwen-base variant. The trade-off pattern across tasks is similar.

  4. Single-seed estimates: All reported numbers are from a single training run with seed 42; training variance is not characterized.

License

Apache 2.0, matching the upstream Octen-Embedding-8B base model and FrenzyMath training data.

Acknowledgments

Built on Octen-Embedding-8B (Octen Team). Training data derived from FrenzyMath's mathlib_informal release. Mathematics from the mathlib4 community.

Downloads last month
26
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for uw-math-ai/MathLeap-Octen-8B

Finetuned
(4)
this model