| --- |
| tags: |
| - sentence-transformers |
| - sentence-similarity |
| - feature-extraction |
| - mathematics |
| - scientific-papers |
| - retrieval |
| - matryoshka |
| base_model: allenai/specter2_base |
| library_name: sentence-transformers |
| pipeline_tag: sentence-similarity |
| language: en |
| license: apache-2.0 |
| --- |
| |
| # math-embed |
|
|
| A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on **combinatorics** and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on [SPECTER2](https://huggingface.co/allenai/specter2_base) and trained using knowledge-graph-guided contrastive learning. |
|
|
| ## Performance |
|
|
| Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks): |
|
|
| | Model | MRR | NDCG@10 | |
| |-------|-----|---------| |
| | **math-embed (this model)** | **0.816** | **0.736** | |
| | OpenAI text-embedding-3-small | 0.461 | 0.324 | |
| | SPECTER2 (proximity adapter) | 0.360 | 0.225 | |
| | SciNCL | 0.306 | 0.205 | |
|
|
| ## Usage |
|
|
| ```python |
| from sentence_transformers import SentenceTransformer |
| |
| model = SentenceTransformer("RobBobin/math-embed") |
| |
| # Embed queries and documents |
| queries = ["Kostka polynomials", "representation theory of symmetric groups"] |
| docs = ["We study the combinatorial properties of Kostka numbers..."] |
| |
| query_embs = model.encode(queries) |
| doc_embs = model.encode(docs) |
| ``` |
|
|
| ### Matryoshka dimensions |
|
|
| Trained with Matryoshka Representation Learning β you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation: |
|
|
| ```python |
| # Use 256-dim embeddings for faster retrieval |
| embs = model.encode(texts) |
| embs_256 = embs[:, :256] |
| ``` |
|
|
| ## Training |
|
|
| ### Method |
| - **Loss**: MultipleNegativesRankingLoss + MatryoshkaLoss |
| - **Training data**: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts |
| - **Direct pairs**: concept name/description β chunks from that concept's source papers |
| - **Edge pairs**: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends") |
| - **Base model**: `allenai/specter2_base` (SciBERT pre-trained on 6M citation triplets) |
|
|
| ### Configuration |
| - Epochs: 3 |
| - Batch size: 8 (effective 32 with gradient accumulation) |
| - Learning rate: 2e-5 |
| - Max sequence length: 256 tokens |
| - Matryoshka dims: [768, 512, 256, 128] |
|
|
| ### Model lineage |
| ``` |
| BERT (Google, 110M params) |
| ββ SciBERT (Allen AI, retrained on scientific papers) |
| ββ SPECTER2 base (Allen AI, + 6M citation triplets) |
| ββ math-embed (this model, + KG-derived concept-chunk pairs) |
| ``` |
|
|
| ## Approach |
|
|
| The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples. |
|
|
| This is a form of **knowledge distillation** β a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval. |
|
|
| ## Limitations |
|
|
| - Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics) |
| - May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning |
| - 256-token context window (standard for BERT-based models) |
|
|
| ## Citation |
|
|
| See the accompanying paper: [*Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval*](https://huggingface.co/RobBobin/math-embed/blob/main/paper/math_embeddings.pdf) |
|
|