MathLeap-Qwen-8B

A retrieval-tuned embedding model for mathematical text. Fine-tuned from Qwen/Qwen3-Embedding-8B on mathlib4 concepts via multi-view contrastive learning.

On AMP — a benchmark of mathematical statements paired across radically different presentations (e.g., set-theoretic vs. category-theoretic phrasings of the same theorem) — MathLeap-Qwen-8B achieves MMR 0.43, beating its base Qwen3-Embedding-8B (0.32, +0.11) and the retrieval-specialized Octen-Embedding-8B (0.42).

Model	Base	AMP MMR ↑ (specialized prompt)
Qwen3-Embedding-8B	—	0.32
Octen-Embedding-8B	Qwen3-8B	0.42
MathLeap-Qwen-8B (this)	Qwen3-8B	0.43

Usage

from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

# Download model files from anonymous mirror
model_path = snapshot_download(
    repo_id="anonymous-submission/MathLeap-Qwen-8B",
    endpoint="https://anonymous-hf.up.railway.app/a/pv25ongyl2qb/ ", 
)

# Load locally
model = SentenceTransformer(model_path)

query = "For any natural number n, n + 0 = n."
docs = [
    "theorem add_zero (n : ℕ) : n + 0 = n := rfl",
    "theorem mul_zero (n : ℕ) : n * 0 = 0 := rfl",
]
q_emb = model.encode([query])
d_emb = model.encode(docs)
print(q_emb @ d_emb.T)

Model details


Base model	Qwen/Qwen3-Embedding-8B
Parameters	~8B (decoder-only transformer)
Embedding dimension	4096
Max sequence length	128 (training); 32K from base
Pooling	Last-token (Qwen3 default)
Similarity	Cosine
Normalization	L2

Training

Data

Each concept has up to four parallel representations:

nl_informal: informal natural-language description (all concepts)
nl_informal_2: LLM-generated NL rephrasing (~85% of concepts)
lean_type: Lean 4 type signature
lean_signature: full Lean 4 declaration

See dataset card.

Objective

Multi-view contrastive learning with CachedMultipleNegativesRankingLoss (Gao et al., 2021). For each concept, the data loader samples a random view pair as (anchor, positive); other concepts in the batch serve as in-batch negatives. The cached variant chunks forward passes so 8B models fit at batch_size 16 on a single 80GB GPU.

This implicitly covers all six retrieval directions across NL and Lean modalities — NL→Lean, Lean→NL, NL↔NL, Lean↔Lean — without explicit direction supervision.

Hyperparameters


Loss	CachedMultipleNegativesRankingLoss
Cosine similarity scale	20.0
Optimizer	AdamW (sentence-transformers default)
Learning rate	2e-5
Warmup steps	700
Scheduler	warmupconstant
Batch size	16
Max sequence length	128
Epochs	2
Seed	42
Train/dev split	Module-level group split, dev_frac=0.1
Train concepts	118,334
Dev concepts (held-out)	15,287
n_hard_negs	999(0 hn involved)

Evaluation

AMP (cross-presentation equivalence)

AMP pairs theorem statements across different mathematical domains (e.g., set theory ↔ category theory). The benchmark tests whether embedding models recognize semantic equivalence under radical surface variation.

Model	MMR ↑ (specialized prompt)	MMR ↑ (no prompt)	Mean rank ↓ (specialized)
Qwen3-Embedding-4B	0.28	0.26	21.4
Qwen3-Embedding-8B	0.32	0.28	16.1
harrier-oss-v1-27b	0.33	0.26	15.2
KaLM-Embedding-Gemma3-12B-2511	0.23	0.24	25.6
Octen-Embedding-8B	0.42	0.41	10.3
MathLeap-Qwen-8B (this)	0.43	0.43	9.7

In-domain held-out retrieval (15,287 dev concepts)

Six retrieval directions, R@1 (base → fine-tuned):

Direction	Base	Fine-tuned	Δ
NL → Lean type	0.644	0.813	+0.169
NL → Lean signature	0.760	0.868	+0.109
NL rephrasing → Lean signature	0.703	0.822	+0.119
Lean type → Lean signature	0.683	0.810	+0.127
NL rephrasing → NL informal	0.856	0.925	+0.069
Lean signature → NL informal	0.657	0.855	+0.197
Mean			+0.132

All six directions improve simultaneously, with the largest gains on cross-modal Lean signature retrieval. See paper Section 5 for analysis of why same-modality directions also improve at n_hard_negs=0.

MIRB (out-of-distribution retrieval)

Performance on selected MIRB tasks (NDCG@10 × 100). Numbers includeded in paper.

Intended use

Semantic search over mathlib4 declarations (NL query → Lean retrieval, or Lean query → NL retrieval)
Cross-presentation similarity scoring for mathematical statements
Embedding feature extraction for downstream math NLP tasks
Retrieval-augmented generation in mathematical contexts (combine with a generator LLM)

Out-of-scope use

Open-ended mathematical question answering or proof generation — this is a retrieval model and does not generate text
Non-mathematical text — heavily specialized; performance on general English embedding tasks may be substantially worse than the base model
Safety-critical mathematical verification without independent checking
Languages other than English (training data is English-only)
Cross-lingual mathematical retrieval (not evaluated)

Limitations

Same-distribution training: Trained on FrenzyMath-derived data which has a particular informalization style. Performance on math text in substantially different registers (e.g., MathOverflow conversational math, textbook prose) may degrade.
Proof-state retrieval not supported: The training data does not include Lean proof states (tactic-mode terms). On benchmarks like MIRB LeanPremiseRetrieval where queries are proof states, this model underperforms the base.
Single-seed estimates: All reported numbers are from a single training run with seed 42. Variance across seeds is not characterized (future work).
No multilingual support: English-only training data.

License

Apache 2.0, matching the upstream Qwen3-Embedding-8B base model and FrenzyMath training data.

Acknowledgments

We thank the Qwen team for releasing Qwen3-Embedding-8B, the FrenzyMath team for releasing the upstream dataset, and the mathlib4 community for the underlying mathematical content.

Downloads last month: 47

Safetensors

Model size

8B params

Tensor type

F32

Model tree for uw-math-ai/MathLeap-Qwen-8B

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Finetuned

(38)

this model

Quantizations

1 model