File size: 3,573 Bytes
8608c67 7c89ee1 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | ---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- mathematics
- scientific-papers
- retrieval
- matryoshka
base_model: allenai/specter2_base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
language: en
license: apache-2.0
---
# math-embed
A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on **combinatorics** and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on [SPECTER2](https://huggingface.co/allenai/specter2_base) and trained using knowledge-graph-guided contrastive learning.
## Performance
Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks):
| Model | MRR | NDCG@10 |
|-------|-----|---------|
| **math-embed (this model)** | **0.816** | **0.736** |
| OpenAI text-embedding-3-small | 0.461 | 0.324 |
| SPECTER2 (proximity adapter) | 0.360 | 0.225 |
| SciNCL | 0.306 | 0.205 |
## Usage
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("RobBobin/math-embed")
# Embed queries and documents
queries = ["Kostka polynomials", "representation theory of symmetric groups"]
docs = ["We study the combinatorial properties of Kostka numbers..."]
query_embs = model.encode(queries)
doc_embs = model.encode(docs)
```
### Matryoshka dimensions
Trained with Matryoshka Representation Learning β you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation:
```python
# Use 256-dim embeddings for faster retrieval
embs = model.encode(texts)
embs_256 = embs[:, :256]
```
## Training
### Method
- **Loss**: MultipleNegativesRankingLoss + MatryoshkaLoss
- **Training data**: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts
- **Direct pairs**: concept name/description β chunks from that concept's source papers
- **Edge pairs**: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends")
- **Base model**: `allenai/specter2_base` (SciBERT pre-trained on 6M citation triplets)
### Configuration
- Epochs: 3
- Batch size: 8 (effective 32 with gradient accumulation)
- Learning rate: 2e-5
- Max sequence length: 256 tokens
- Matryoshka dims: [768, 512, 256, 128]
### Model lineage
```
BERT (Google, 110M params)
ββ SciBERT (Allen AI, retrained on scientific papers)
ββ SPECTER2 base (Allen AI, + 6M citation triplets)
ββ math-embed (this model, + KG-derived concept-chunk pairs)
```
## Approach
The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples.
This is a form of **knowledge distillation** β a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval.
## Limitations
- Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics)
- May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning
- 256-token context window (standard for BERT-based models)
## Citation
See the accompanying paper: [*Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval*](https://huggingface.co/RobBobin/math-embed/blob/main/paper/math_embeddings.pdf)
|