Upload README.md with huggingface_hub

7c89ee1 verified 14 days ago

3.57 kB

tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - mathematics
  - scientific-papers
  - retrieval
  - matryoshka
base_model: allenai/specter2_base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
language: en
license: apache-2.0

math-embed

A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on combinatorics and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on SPECTER2 and trained using knowledge-graph-guided contrastive learning.

Performance

Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks):

Model	MRR	NDCG@10
math-embed (this model)	0.816	0.736
OpenAI text-embedding-3-small	0.461	0.324
SPECTER2 (proximity adapter)	0.360	0.225
SciNCL	0.306	0.205

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("RobBobin/math-embed")

# Embed queries and documents
queries = ["Kostka polynomials", "representation theory of symmetric groups"]
docs = ["We study the combinatorial properties of Kostka numbers..."]

query_embs = model.encode(queries)
doc_embs = model.encode(docs)

Matryoshka dimensions

Trained with Matryoshka Representation Learning — you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation:

# Use 256-dim embeddings for faster retrieval
embs = model.encode(texts)
embs_256 = embs[:, :256]

Training

Method

Loss: MultipleNegativesRankingLoss + MatryoshkaLoss
Training data: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts
- Direct pairs: concept name/description → chunks from that concept's source papers
- Edge pairs: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends")
Base model: allenai/specter2_base (SciBERT pre-trained on 6M citation triplets)

Configuration

Epochs: 3
Batch size: 8 (effective 32 with gradient accumulation)
Learning rate: 2e-5
Max sequence length: 256 tokens
Matryoshka dims: [768, 512, 256, 128]

Model lineage

BERT (Google, 110M params)
  └─ SciBERT (Allen AI, retrained on scientific papers)
      └─ SPECTER2 base (Allen AI, + 6M citation triplets)
          └─ math-embed (this model, + KG-derived concept-chunk pairs)

Approach

The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples.

This is a form of knowledge distillation — a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval.

Limitations

Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics)
May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning
256-token context window (standard for BERT-based models)

Citation

See the accompanying paper: Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval