--- tags: - sentence-transformers - sentence-similarity - feature-extraction - mathematics - scientific-papers - retrieval - matryoshka base_model: allenai/specter2_base library_name: sentence-transformers pipeline_tag: sentence-similarity language: en license: apache-2.0 --- # math-embed A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on **combinatorics** and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on [SPECTER2](https://huggingface.co/allenai/specter2_base) and trained using knowledge-graph-guided contrastive learning. ## Performance Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks): | Model | MRR | NDCG@10 | |-------|-----|---------| | **math-embed (this model)** | **0.816** | **0.736** | | OpenAI text-embedding-3-small | 0.461 | 0.324 | | SPECTER2 (proximity adapter) | 0.360 | 0.225 | | SciNCL | 0.306 | 0.205 | ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("RobBobin/math-embed") # Embed queries and documents queries = ["Kostka polynomials", "representation theory of symmetric groups"] docs = ["We study the combinatorial properties of Kostka numbers..."] query_embs = model.encode(queries) doc_embs = model.encode(docs) ``` ### Matryoshka dimensions Trained with Matryoshka Representation Learning — you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation: ```python # Use 256-dim embeddings for faster retrieval embs = model.encode(texts) embs_256 = embs[:, :256] ``` ## Training ### Method - **Loss**: MultipleNegativesRankingLoss + MatryoshkaLoss - **Training data**: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts - **Direct pairs**: concept name/description → chunks from that concept's source papers - **Edge pairs**: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends") - **Base model**: `allenai/specter2_base` (SciBERT pre-trained on 6M citation triplets) ### Configuration - Epochs: 3 - Batch size: 8 (effective 32 with gradient accumulation) - Learning rate: 2e-5 - Max sequence length: 256 tokens - Matryoshka dims: [768, 512, 256, 128] ### Model lineage ``` BERT (Google, 110M params) └─ SciBERT (Allen AI, retrained on scientific papers) └─ SPECTER2 base (Allen AI, + 6M citation triplets) └─ math-embed (this model, + KG-derived concept-chunk pairs) ``` ## Approach The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples. This is a form of **knowledge distillation** — a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval. ## Limitations - Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics) - May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning - 256-token context window (standard for BERT-based models) ## Citation See the accompanying paper: [*Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval*](https://huggingface.co/RobBobin/math-embed/blob/main/paper/math_embeddings.pdf)