MathLeap-Octen-8B

An 8B retrieval-tuned embedding model for mathematical text. Fine-tuned from Octen/Octen-Embedding-8B. It's a retrieval-specialized variant of Qwen3-Embedding-8B — on mathlib4 concepts via multi-view contrastive learning. The goal is to embed mathematically equivalent statements close together regardless of the surface language they are written in.

Usage

from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

# Download model files from anonymous mirror
model_path = snapshot_download(
    repo_id="anonymous-submission/MathLeap-Octen-8B",
    endpoint="https://anonymous-hf.up.railway.app/a/9n9cngyu38hk/", 
)

# Load locally
model = SentenceTransformer(model_path)

query = "For any natural number n, n + 0 = n."
docs = [
    "theorem add_zero (n : ℕ) : n + 0 = n := rfl",
    "theorem mul_zero (n : ℕ) : n * 0 = 0 := rfl",
]
q_emb = model.encode([query])
d_emb = model.encode(docs)
print(q_emb @ d_emb.T)

The model accepts queries and documents interchangeably across three forms: informal natural-language descriptions, Lean 4 type signatures (e.g., α → α → α), and full Lean 4 declarations (e.g., theorem foo (x : ℕ) : ...).

Model details


Base model	Octen/Octen-Embedding-8B (a retrieval-specialized fine-tune of Qwen3-Embedding-8B)
Parameters	~8B (decoder-only transformer)
Embedding dimension	4096
Max sequence length	128 (training); 32K from base
Pooling	Last-token
Similarity	Cosine, L2-normalized

Training

Data

The training set is derived from FrenzyMath/mathlib_informal_v4.19.0 (Apache 2.0), augmented with LLM-generated natural-language rephrasings. Each mathlib concept has up to four parallel "views":

nl_informal: informal natural-language description
nl_informal_2: LLM-generated NL rephrasing (~85% of concepts)
lean_type: Lean 4 type signature
lean_signature: full Lean 4 declaration

Rephrasings were generated by prompting an LLM to first decompose each statement into hypotheses and conclusions, then reassemble them with a different surface form (different variable names, sentence structure, voice, or quantifier ordering) while preserving mathematical content.

The full training set contains 118,334 concepts; held-out evaluation uses 15,287 concepts. Train/dev split is by mathlib module to prevent leakage of structurally related theorems.

Objective

Multi-view contrastive learning with CachedMultipleNegativesRankingLoss (Gao et al., 2021). For each concept, the data loader samples a random view pair as (anchor, positive); other concepts in the same batch serve as in-batch negatives. The cached variant chunks forward passes so that 8B models fit at batch_size 16 on a single 80GB GPU while remaining mathematically equivalent to standard MNRL.

This setup implicitly covers all six retrieval directions across NL and Lean modalities (NL→Lean, Lean→NL, NL↔NL, Lean↔Lean) without explicit direction supervision.

Hyperparameters


Loss	CachedMultipleNegativesRankingLoss
Cosine similarity scale	20.0
Optimizer	AdamW
Warmup steps	700
Scheduler	warmupconstant
Batch size	16
Max sequence length	128
Epochs	2 (≈ 14,790 steps)
Seed	42
Train/dev split	Module-level group split, dev_frac = 0.1
Train concepts	118,334
Held-out concepts	15,287

Evaluation

In-domain held-out FrenzyMath retrieval (six directions)

R@1 / MRR on the held-out test set (15,287 concepts):

Direction	R@1 base → FT	MRR base → FT
Informal → Lean type	64.06 → 70.18	74.47 → 78.26
Lean type → Informal	58.17 → 66.89	69.90 → 75.57
Informal → Lean signature	56.08 → 67.24	66.28 → 76.28
Lean signature → Informal	57.44 → 72.77	68.48 → 81.37
Lean type → Lean signature	60.40 → 65.26	69.81 → 73.61
Lean signature → Lean type	70.04 → 73.34	78.45 → 80.56

All six directions improve, with the largest gains on Lean → Informal directions. Improvements stack on top of the strong retrieval-specialized starting point that Octen-Embedding-8B already provides.

Lean Blueprints retrieval

Model	R@1	R@5	R@10
Qwen3-Embedding-8B	59.52	90.13	94.11
Octen-Embedding-8B (base)	64.32	92.60	96.37
MathLeap-Qwen-8B	64.73	93.10	96.44
MathLeap-Octen-8B (this)	64.69	92.80	96.92

MIRB (out-of-distribution; nDCG@10)

Model	MO Duplicate	Mathlib Retrieval	Lean Premise
Qwen3-Embedding-8B	0.859	0.559	0.136
Octen-Embedding-8B (base)	0.855	0.653	0.143
MathLeap-Qwen-8B	0.821	0.653	0.103
MathLeap-Octen-8B (this)	0.816	0.667	0.103

MathLeap-Octen-8B improves on Mathlib Retrieval (the task closest to the training distribution) but degrades on MathOverflow Duplicate retrieval (which involves conversational math beyond rigorous theorem statements) and Lean Premise retrieval (whose queries are proof states, not represented in our training data).

Intended use

Semantic search over mathlib4 declarations (NL → Lean or Lean → NL)
Cross-presentation similarity scoring for mathematical statements written in different mathematical sub-languages
Embedding feature extraction for downstream math NLP tasks
Retrieval-augmented generation in mathematical contexts

Out-of-scope use

Open-ended math QA or proof generation (this is a retrieval model)
Non-mathematical English text — heavily specialized, no general-purpose embedding ability guaranteed
Safety-critical mathematical verification without independent checking
Proof-state retrieval (training data does not include tactic-mode Lean)
Languages other than English
Cross-lingual mathematical retrieval (not evaluated)

Limitations

Same-distribution training: Trained on FrenzyMath-derived data with a particular informalization style. Performance on math text in substantially different registers (conversational math, textbook prose, research-paper exposition) may degrade. We observe this on MathOverflow Duplicate retrieval, where MathLeap-Octen-8B underperforms its base.
No proof-state coverage: Training data contains theorem statements (lean_type, lean_signature) but not Lean proof states. On Lean Premise retrieval (MIRB), where queries are tactic-mode terms, MathLeap-Octen-8B underperforms its base.
Stacked specialization: Since the base (Octen-Embedding-8B) is already a retrieval-specialized fine-tune of Qwen3-Embedding-8B, gains from MathLeap fine-tuning are smaller in absolute magnitude than for the Qwen-base variant. The trade-off pattern across tasks is similar.
Single-seed estimates: All reported numbers are from a single training run with seed 42; training variance is not characterized.