File size: 3,573 Bytes
8608c67
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c89ee1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- mathematics
- scientific-papers
- retrieval
- matryoshka
base_model: allenai/specter2_base
library_name: sentence-transformers
pipeline_tag: sentence-similarity
language: en
license: apache-2.0
---

# math-embed

A 768-dimensional embedding model fine-tuned for mathematical document retrieval, with a focus on **combinatorics** and related areas (representation theory, symmetric functions, algebraic combinatorics). Built on [SPECTER2](https://huggingface.co/allenai/specter2_base) and trained using knowledge-graph-guided contrastive learning.

## Performance

Benchmarked on mathematical paper retrieval (108 queries, 4,794 paper chunks):

| Model | MRR | NDCG@10 |
|-------|-----|---------|
| **math-embed (this model)** | **0.816** | **0.736** |
| OpenAI text-embedding-3-small | 0.461 | 0.324 |
| SPECTER2 (proximity adapter) | 0.360 | 0.225 |
| SciNCL | 0.306 | 0.205 |

## Usage

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("RobBobin/math-embed")

# Embed queries and documents
queries = ["Kostka polynomials", "representation theory of symmetric groups"]
docs = ["We study the combinatorial properties of Kostka numbers..."]

query_embs = model.encode(queries)
doc_embs = model.encode(docs)
```

### Matryoshka dimensions

Trained with Matryoshka Representation Learning β€” you can truncate embeddings to smaller dimensions (512, 256, 128) with graceful degradation:

```python
# Use 256-dim embeddings for faster retrieval
embs = model.encode(texts)
embs_256 = embs[:, :256]
```

## Training

### Method
- **Loss**: MultipleNegativesRankingLoss + MatryoshkaLoss
- **Training data**: 22,609 (anchor, positive) pairs generated from a knowledge graph of mathematical concepts
  - **Direct pairs**: concept name/description β†’ chunks from that concept's source papers
  - **Edge pairs**: cross-concept pairs from knowledge graph edges (e.g., "generalizes", "extends")
- **Base model**: `allenai/specter2_base` (SciBERT pre-trained on 6M citation triplets)

### Configuration
- Epochs: 3
- Batch size: 8 (effective 32 with gradient accumulation)
- Learning rate: 2e-5
- Max sequence length: 256 tokens
- Matryoshka dims: [768, 512, 256, 128]

### Model lineage
```
BERT (Google, 110M params)
  └─ SciBERT (Allen AI, retrained on scientific papers)
      └─ SPECTER2 base (Allen AI, + 6M citation triplets)
          └─ math-embed (this model, + KG-derived concept-chunk pairs)
```

## Approach

The knowledge graph was constructed by an LLM (GPT-4o-mini) from 75 mathematical research papers, identifying 559 concepts and 486 relationships. This graph provides structured ground truth: each concept maps to specific papers, and those papers' chunks serve as positive training examples.

This is a form of **knowledge distillation** β€” a large language model's understanding of mathematical relationships is distilled into a small, fast embedding model suitable for retrieval.

## Limitations

- Trained specifically on combinatorics papers (symmetric functions, representation theory, partition identities, algebraic combinatorics)
- May not generalize well to other areas of mathematics or other scientific domains without additional fine-tuning
- 256-token context window (standard for BERT-based models)

## Citation

See the accompanying paper: [*Knowledge-Graph-Guided Fine-Tuning of Embedding Models for Mathematical Document Retrieval*](https://huggingface.co/RobBobin/math-embed/blob/main/paper/math_embeddings.pdf)