Upload README.md with huggingface_hub

7c89ee1 verified 15 days ago

3.57 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- mathematics
	- scientific-papers
	- retrieval
	- matryoshka
	base_model: allenai/specter2_base
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	language: en
	license: apache-2.0
	---

	# math-embed

	A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on combinatorics and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on [SPECTER2](https://huggingface.co/allenai/specter2_base) and trained using knowledge-graph-guided contrastive learning.

	## Performance

	Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks):

	\| Model \| MRR \| NDCG@10 \|
	\|-------\|-----\|---------\|
	\| math-embed (this model) \| 0.816 \| 0.736 \|
	\| OpenAI text-embedding-3-small \| 0.461 \| 0.324 \|
	\| SPECTER2 (proximity adapter) \| 0.360 \| 0.225 \|
	\| SciNCL \| 0.306 \| 0.205 \|

	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("RobBobin/math-embed")

	# Embed queries and documents
	queries = ["Kostka polynomials", "representation theory of symmetric groups"]
	docs = ["We study the combinatorial properties of Kostka numbers..."]

	query_embs = model.encode(queries)
	doc_embs = model.encode(docs)
	```

	### Matryoshka dimensions

	Trained with Matryoshka Representation Learning — you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation:

	```python
	# Use 256-dim embeddings for faster retrieval
	embs = model.encode(texts)
	embs_256 = embs[:, :256]
	```

	## Training

	### Method
	- Loss: MultipleNegativesRankingLoss + MatryoshkaLoss
	- Training data: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts
	- Direct pairs: concept name/description → chunks from that concept's source papers
	- Edge pairs: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends")
	- Base model: `allenai/specter2_base` (SciBERT pre-trained on 6M citation triplets)

	### Configuration
	- Epochs: 3
	- Batch size: 8 (effective 32 with gradient accumulation)
	- Learning rate: 2e-5
	- Max sequence length: 256 tokens
	- Matryoshka dims: [768, 512, 256, 128]

	### Model lineage
	```
	BERT (Google, 110M params)
	└─ SciBERT (Allen AI, retrained on scientific papers)
	└─ SPECTER2 base (Allen AI, + 6M citation triplets)
	└─ math-embed (this model, + KG-derived concept-chunk pairs)
	```

	## Approach

	The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples.

	This is a form of knowledge distillation — a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval.

	## Limitations

	- Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics)
	- May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning
	- 256-token context window (standard for BERT-based models)

	## Citation

	See the accompanying paper: [Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval](https://huggingface.co/RobBobin/math-embed/blob/main/paper/math_embeddings.pdf)