math-embed / README.md
RobBobin's picture
Upload README.md with huggingface_hub
7c89ee1 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - mathematics
  - scientific-papers
  - retrieval
  - matryoshka
base_model: allenai/specter2_base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
language: en
license: apache-2.0

math-embed

A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on combinatorics and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on SPECTER2 and trained using knowledge-graph-guided contrastive learning.

Performance

Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks):

Model MRR NDCG@10
math-embed (this model) 0.816 0.736
OpenAI text-embedding-3-small 0.461 0.324
SPECTER2 (proximity adapter) 0.360 0.225
SciNCL 0.306 0.205

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("RobBobin/math-embed")

# Embed queries and documents
queries = ["Kostka polynomials", "representation theory of symmetric groups"]
docs = ["We study the combinatorial properties of Kostka numbers..."]

query_embs = model.encode(queries)
doc_embs = model.encode(docs)

Matryoshka dimensions

Trained with Matryoshka Representation Learning β€” you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation:

# Use 256-dim embeddings for faster retrieval
embs = model.encode(texts)
embs_256 = embs[:, :256]

Training

Method

  • Loss: MultipleNegativesRankingLoss + MatryoshkaLoss
  • Training data: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts
    • Direct pairs: concept name/description β†’ chunks from that concept's source papers
    • Edge pairs: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends")
  • Base model: allenai/specter2_base (SciBERT pre-trained on 6M citation triplets)

Configuration

  • Epochs: 3
  • Batch size: 8 (effective 32 with gradient accumulation)
  • Learning rate: 2e-5
  • Max sequence length: 256 tokens
  • Matryoshka dims: [768, 512, 256, 128]

Model lineage

BERT (Google, 110M params)
  └─ SciBERT (Allen AI, retrained on scientific papers)
      └─ SPECTER2 base (Allen AI, + 6M citation triplets)
          └─ math-embed (this model, + KG-derived concept-chunk pairs)

Approach

The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples.

This is a form of knowledge distillation β€” a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval.

Limitations

  • Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics)
  • May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning
  • 256-token context window (standard for BERT-based models)

Citation

See the accompanying paper: Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval