Instructions to use uw-math-ai/MathLeap-Qwen-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use uw-math-ai/MathLeap-Qwen-8B with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("uw-math-ai/MathLeap-Qwen-8B") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
MathLeap-Qwen-8B
A retrieval-tuned embedding model for mathematical text. Fine-tuned from Qwen/Qwen3-Embedding-8B on mathlib4 concepts via multi-view contrastive learning.
On AMP โ a benchmark of mathematical statements paired across radically different presentations (e.g., set-theoretic vs. category-theoretic phrasings of the same theorem) โ MathLeap-Qwen-8B achieves MMR 0.43, beating its base Qwen3-Embedding-8B (0.32, +0.11) and the retrieval-specialized Octen-Embedding-8B (0.42).
| Model | Base | AMP MMR โ (specialized prompt) |
|---|---|---|
| Qwen3-Embedding-8B | โ | 0.32 |
| Octen-Embedding-8B | Qwen3-8B | 0.42 |
| MathLeap-Qwen-8B (this) | Qwen3-8B | 0.43 |
Usage
from huggingface_hub import snapshot_download
from sentence_transformers import SentenceTransformer
# Download model files from anonymous mirror
model_path = snapshot_download(
repo_id="anonymous-submission/MathLeap-Qwen-8B",
endpoint="https://anonymous-hf.up.railway.app/a/pv25ongyl2qb/ ",
)
# Load locally
model = SentenceTransformer(model_path)
query = "For any natural number n, n + 0 = n."
docs = [
"theorem add_zero (n : โ) : n + 0 = n := rfl",
"theorem mul_zero (n : โ) : n * 0 = 0 := rfl",
]
q_emb = model.encode([query])
d_emb = model.encode(docs)
print(q_emb @ d_emb.T)
Model details
| Base model | Qwen/Qwen3-Embedding-8B |
| Parameters | ~8B (decoder-only transformer) |
| Embedding dimension | 4096 |
| Max sequence length | 128 (training); 32K from base |
| Pooling | Last-token (Qwen3 default) |
| Similarity | Cosine |
| Normalization | L2 |
Training
Data
Each concept has up to four parallel representations:
nl_informal: informal natural-language description (all concepts)nl_informal_2: LLM-generated NL rephrasing (~85% of concepts)lean_type: Lean 4 type signaturelean_signature: full Lean 4 declaration
See dataset card.
Objective
Multi-view contrastive learning with
CachedMultipleNegativesRankingLoss
(Gao et al., 2021). For each concept, the data loader samples a random
view pair as (anchor, positive); other concepts in the batch serve as
in-batch negatives. The cached variant chunks forward passes so 8B models
fit at batch_size 16 on a single 80GB GPU.
This implicitly covers all six retrieval directions across NL and Lean modalities โ NLโLean, LeanโNL, NLโNL, LeanโLean โ without explicit direction supervision.
Hyperparameters
| Loss | CachedMultipleNegativesRankingLoss |
| Cosine similarity scale | 20.0 |
| Optimizer | AdamW (sentence-transformers default) |
| Learning rate | 2e-5 |
| Warmup steps | 700 |
| Scheduler | warmupconstant |
| Batch size | 16 |
| Max sequence length | 128 |
| Epochs | 2 |
| Seed | 42 |
| Train/dev split | Module-level group split, dev_frac=0.1 |
| Train concepts | 118,334 |
| Dev concepts (held-out) | 15,287 |
| n_hard_negs | 999(0 hn involved) |
Evaluation
AMP (cross-presentation equivalence)
AMP pairs theorem statements across different mathematical domains (e.g., set theory โ category theory). The benchmark tests whether embedding models recognize semantic equivalence under radical surface variation.
| Model | MMR โ (specialized prompt) | MMR โ (no prompt) | Mean rank โ (specialized) |
|---|---|---|---|
| Qwen3-Embedding-4B | 0.28 | 0.26 | 21.4 |
| Qwen3-Embedding-8B | 0.32 | 0.28 | 16.1 |
| harrier-oss-v1-27b | 0.33 | 0.26 | 15.2 |
| KaLM-Embedding-Gemma3-12B-2511 | 0.23 | 0.24 | 25.6 |
| Octen-Embedding-8B | 0.42 | 0.41 | 10.3 |
| MathLeap-Qwen-8B (this) | 0.43 | 0.43 | 9.7 |
In-domain held-out retrieval (15,287 dev concepts)
Six retrieval directions, R@1 (base โ fine-tuned):
| Direction | Base | Fine-tuned | ฮ |
|---|---|---|---|
| NL โ Lean type | 0.644 | 0.813 | +0.169 |
| NL โ Lean signature | 0.760 | 0.868 | +0.109 |
| NL rephrasing โ Lean signature | 0.703 | 0.822 | +0.119 |
| Lean type โ Lean signature | 0.683 | 0.810 | +0.127 |
| NL rephrasing โ NL informal | 0.856 | 0.925 | +0.069 |
| Lean signature โ NL informal | 0.657 | 0.855 | +0.197 |
| Mean | +0.132 |
All six directions improve simultaneously, with the largest gains on
cross-modal Lean signature retrieval. See paper Section 5 for analysis
of why same-modality directions also improve at n_hard_negs=0.
MIRB (out-of-distribution retrieval)
Performance on selected MIRB tasks (NDCG@10 ร 100). Numbers includeded in paper.
Intended use
- Semantic search over mathlib4 declarations (NL query โ Lean retrieval, or Lean query โ NL retrieval)
- Cross-presentation similarity scoring for mathematical statements
- Embedding feature extraction for downstream math NLP tasks
- Retrieval-augmented generation in mathematical contexts (combine with a generator LLM)
Out-of-scope use
- Open-ended mathematical question answering or proof generation โ this is a retrieval model and does not generate text
- Non-mathematical text โ heavily specialized; performance on general English embedding tasks may be substantially worse than the base model
- Safety-critical mathematical verification without independent checking
- Languages other than English (training data is English-only)
- Cross-lingual mathematical retrieval (not evaluated)
Limitations
Same-distribution training: Trained on FrenzyMath-derived data which has a particular informalization style. Performance on math text in substantially different registers (e.g., MathOverflow conversational math, textbook prose) may degrade.
Proof-state retrieval not supported: The training data does not include Lean proof states (tactic-mode terms). On benchmarks like MIRB LeanPremiseRetrieval where queries are proof states, this model underperforms the base.
Single-seed estimates: All reported numbers are from a single training run with seed 42. Variance across seeds is not characterized (future work).
No multilingual support: English-only training data.
License
Apache 2.0, matching the upstream Qwen3-Embedding-8B base model and FrenzyMath training data.
Acknowledgments
We thank the Qwen team for releasing Qwen3-Embedding-8B, the FrenzyMath team for releasing the upstream dataset, and the mathlib4 community for the underlying mathematical content.
- Downloads last month
- 22