MathLeap-Qwen-8B

A retrieval-tuned embedding model for mathematical text. Fine-tuned from Qwen/Qwen3-Embedding-8B on mathlib4 concepts via multi-view contrastive learning.

On AMP โ€” a benchmark of mathematical statements paired across radically different presentations (e.g., set-theoretic vs. category-theoretic phrasings of the same theorem) โ€” MathLeap-Qwen-8B achieves MMR 0.43, beating its base Qwen3-Embedding-8B (0.32, +0.11) and the retrieval-specialized Octen-Embedding-8B (0.42).

Model Base AMP MMR โ†‘ (specialized prompt)
Qwen3-Embedding-8B โ€” 0.32
Octen-Embedding-8B Qwen3-8B 0.42
MathLeap-Qwen-8B (this) Qwen3-8B 0.43

Usage

from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer

# Download model files from anonymous mirror
model_path = snapshot_download(
    repo_id="anonymous-submission/MathLeap-Qwen-8B",
    endpoint="https://anonymous-hf.up.railway.app/a/pv25ongyl2qb/ ", 
)

# Load locally
model = SentenceTransformer(model_path)

query = "For any natural number n, n + 0 = n."
docs = [
    "theorem add_zero (n : โ„•) : n + 0 = n := rfl",
    "theorem mul_zero (n : โ„•) : n * 0 = 0 := rfl",
]
q_emb = model.encode([query])
d_emb = model.encode(docs)
print(q_emb @ d_emb.T)

Model details

Base model Qwen/Qwen3-Embedding-8B
Parameters ~8B (decoder-only transformer)
Embedding dimension 4096
Max sequence length 128 (training); 32K from base
Pooling Last-token (Qwen3 default)
Similarity Cosine
Normalization L2

Training

Data

Each concept has up to four parallel representations:

  • nl_informal: informal natural-language description (all concepts)
  • nl_informal_2: LLM-generated NL rephrasing (~85% of concepts)
  • lean_type: Lean 4 type signature
  • lean_signature: full Lean 4 declaration

See dataset card.

Objective

Multi-view contrastive learning with CachedMultipleNegativesRankingLoss (Gao et al., 2021). For each concept, the data loader samples a random view pair as (anchor, positive); other concepts in the batch serve as in-batch negatives. The cached variant chunks forward passes so 8B models fit at batch_size 16 on a single 80GB GPU.

This implicitly covers all six retrieval directions across NL and Lean modalities โ€” NLโ†’Lean, Leanโ†’NL, NLโ†”NL, Leanโ†”Lean โ€” without explicit direction supervision.

Hyperparameters

Loss CachedMultipleNegativesRankingLoss
Cosine similarity scale 20.0
Optimizer AdamW (sentence-transformers default)
Learning rate 2e-5
Warmup steps 700
Scheduler warmupconstant
Batch size 16
Max sequence length 128
Epochs 2
Seed 42
Train/dev split Module-level group split, dev_frac=0.1
Train concepts 118,334
Dev concepts (held-out) 15,287
n_hard_negs 999(0 hn involved)

Evaluation

AMP (cross-presentation equivalence)

AMP pairs theorem statements across different mathematical domains (e.g., set theory โ†” category theory). The benchmark tests whether embedding models recognize semantic equivalence under radical surface variation.

Model MMR โ†‘ (specialized prompt) MMR โ†‘ (no prompt) Mean rank โ†“ (specialized)
Qwen3-Embedding-4B 0.28 0.26 21.4
Qwen3-Embedding-8B 0.32 0.28 16.1
harrier-oss-v1-27b 0.33 0.26 15.2
KaLM-Embedding-Gemma3-12B-2511 0.23 0.24 25.6
Octen-Embedding-8B 0.42 0.41 10.3
MathLeap-Qwen-8B (this) 0.43 0.43 9.7

In-domain held-out retrieval (15,287 dev concepts)

Six retrieval directions, R@1 (base โ†’ fine-tuned):

Direction Base Fine-tuned ฮ”
NL โ†’ Lean type 0.644 0.813 +0.169
NL โ†’ Lean signature 0.760 0.868 +0.109
NL rephrasing โ†’ Lean signature 0.703 0.822 +0.119
Lean type โ†’ Lean signature 0.683 0.810 +0.127
NL rephrasing โ†’ NL informal 0.856 0.925 +0.069
Lean signature โ†’ NL informal 0.657 0.855 +0.197
Mean +0.132

All six directions improve simultaneously, with the largest gains on cross-modal Lean signature retrieval. See paper Section 5 for analysis of why same-modality directions also improve at n_hard_negs=0.

MIRB (out-of-distribution retrieval)

Performance on selected MIRB tasks (NDCG@10 ร— 100). Numbers includeded in paper.

Intended use

  • Semantic search over mathlib4 declarations (NL query โ†’ Lean retrieval, or Lean query โ†’ NL retrieval)
  • Cross-presentation similarity scoring for mathematical statements
  • Embedding feature extraction for downstream math NLP tasks
  • Retrieval-augmented generation in mathematical contexts (combine with a generator LLM)

Out-of-scope use

  • Open-ended mathematical question answering or proof generation โ€” this is a retrieval model and does not generate text
  • Non-mathematical text โ€” heavily specialized; performance on general English embedding tasks may be substantially worse than the base model
  • Safety-critical mathematical verification without independent checking
  • Languages other than English (training data is English-only)
  • Cross-lingual mathematical retrieval (not evaluated)

Limitations

  1. Same-distribution training: Trained on FrenzyMath-derived data which has a particular informalization style. Performance on math text in substantially different registers (e.g., MathOverflow conversational math, textbook prose) may degrade.

  2. Proof-state retrieval not supported: The training data does not include Lean proof states (tactic-mode terms). On benchmarks like MIRB LeanPremiseRetrieval where queries are proof states, this model underperforms the base.

  3. Single-seed estimates: All reported numbers are from a single training run with seed 42. Variance across seeds is not characterized (future work).

  4. No multilingual support: English-only training data.

License

Apache 2.0, matching the upstream Qwen3-Embedding-8B base model and FrenzyMath training data.

Acknowledgments

We thank the Qwen team for releasing Qwen3-Embedding-8B, the FrenzyMath team for releasing the upstream dataset, and the mathlib4 community for the underlying mathematical content.

Downloads last month
22
Safetensors
Model size
8B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for uw-math-ai/MathLeap-Qwen-8B

Finetuned
(29)
this model