File size: 19,979 Bytes
371a39d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 | \documentclass[11pt,a4paper]{article}
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
\usepackage{amsmath,amssymb,amsthm}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage[margin=1in]{geometry}
\usepackage{enumitem}
\usepackage{xcolor}
\usepackage{algorithm}
\usepackage{algpseudocode}
\hypersetup{
colorlinks=true,
linkcolor=blue!70!black,
citecolor=green!50!black,
urlcolor=blue!70!black,
}
\title{Knowledge-Graph-Guided Fine-Tuning of Embedding Models\\
for Mathematical Document Retrieval}
\author{Robin Langer\thanks{The author thanks Claude (Anthropic) for assistance with code development and manuscript preparation.}}
\date{}
\begin{document}
\maketitle
\begin{abstract}
We present a method for improving semantic search over mathematical research
papers by fine-tuning embedding models using contrastive learning, guided by
a knowledge graph extracted from the corpus. General-purpose embedding models
(e.g., OpenAI's \texttt{text-embedding-3-small}) and even scientific embedding
models (SPECTER2, SciNCL) perform poorly on mathematical retrieval tasks because
they lack understanding of the semantic relationships between mathematical
concepts. Our approach exploits an existing knowledge graph --- whose nodes are
mathematical concepts and whose edges encode relationships such as
\emph{generalizes}, \emph{proves}, and \emph{is\_instance\_of} --- to
automatically generate training data for contrastive fine-tuning. We benchmark
baseline models against our fine-tuned model on a retrieval task over 4,794
paper chunks spanning 75 papers in algebraic combinatorics, and demonstrate
that domain-specific fine-tuning significantly outperforms all baselines.
The method is general: given any corpus of mathematical papers and a
knowledge graph over their concepts, the same pipeline produces a
domain-adapted embedding model.
\end{abstract}
\section{Introduction}
The increasing volume of mathematical literature makes automated retrieval
tools indispensable for researchers. A common approach is
\emph{retrieval-augmented generation} (RAG): chunk papers into passages, embed
them in a vector space, and retrieve relevant passages via nearest-neighbor
search over embeddings. The quality of retrieval depends critically on the
embedding model's ability to capture \emph{mathematical semantic similarity}
--- the idea that a query like ``Rogers--Ramanujan identities'' should retrieve
not only passages containing that exact phrase but also passages discussing
Bailey's lemma, $q$-series transformations, and partition identities.
General-purpose embedding models are trained on broad web text and lack this
kind of domain knowledge. Scientific embedding models such as SPECTER2
\cite{specter2} and SciNCL \cite{scincl} are trained on citation graphs from
Semantic Scholar, but mathematics is underrepresented in their training data,
and they are optimized for \emph{paper-to-paper} similarity rather than
\emph{concept-to-passage} retrieval.
We address this gap by fine-tuning an embedding model specifically for
mathematical concept retrieval. Our key insight is that a \textbf{knowledge
graph} (KG) extracted from the corpus provides exactly the supervision signal
needed for contrastive learning:
\begin{itemize}[nosep]
\item Each KG concept (e.g., ``Macdonald polynomials'') maps to specific
papers, and hence to specific text chunks. These form
\emph{positive pairs} for contrastive training.
\item KG edges (e.g., ``Bailey's lemma \emph{generalizes}
Rogers--Ramanujan identities'') provide \emph{cross-concept
positives} that teach the model about mathematical relationships.
\item In-batch negatives from unrelated concepts provide the contrastive
signal automatically.
\end{itemize}
This paper makes the following contributions:
\begin{enumerate}[nosep]
\item A benchmark comparing general-purpose and scientific embedding
models on mathematical concept retrieval (Section~\ref{sec:benchmark}).
\item A method for automatically generating contrastive training data from
a knowledge graph (Section~\ref{sec:training-data}).
\item A fine-tuned embedding model that outperforms all baselines on our
benchmark (Section~\ref{sec:finetuning}).
\item An open-source pipeline\footnote{Code available at
\url{https://github.com/RaggedR/embeddings}. Model available at
\url{https://huggingface.co/RobBobin/math-embed}.} that can be applied to any
mathematical corpus with an associated knowledge graph.
\end{enumerate}
\section{Related Work}
\paragraph{Scientific document embeddings.}
SPECTER \cite{specter} introduced citation-based contrastive learning for
scientific document embeddings, training on (paper, cited paper, non-cited
paper) triplets. SPECTER2 \cite{specter2} extended this to 6 million citation
triplets across 23 fields of study and introduced task-specific adapters
(proximity, classification, regression). SciNCL \cite{scincl} improved on
SPECTER by using citation graph \emph{neighborhood} sampling for harder
negatives. All three models use SciBERT \cite{scibert} as their backbone and
produce 768-dimensional embeddings.
\paragraph{Mathematics-specific models.}
MathBERT \cite{mathbert} pre-trained BERT on mathematical curricula and arXiv
abstracts, but only with masked language modeling --- it was not contrastively
trained for retrieval. No widely adopted embedding model exists that is
specifically trained for mathematical semantic similarity.
\paragraph{Contrastive fine-tuning.}
The sentence-transformers framework \cite{sbert} provides
\texttt{MultipleNegativesRankingLoss} (MNRL), which treats all other examples
in a batch as negatives. Matryoshka Representation Learning \cite{matryoshka}
trains embeddings so that any prefix of the full vector is itself a useful
embedding, enabling flexible dimensionality--quality tradeoffs at inference.
\section{Data}
\label{sec:data}
\subsection{Corpus}
Our corpus consists of 75 research papers in algebraic combinatorics,
$q$-series, and related areas, sourced from arXiv. Papers are chunked into
passages of up to 1,500 characters with 200-character overlap, yielding
\textbf{4,794 chunks}. The chunks are stored in a ChromaDB vector database
with embeddings from OpenAI's \texttt{text-embedding-3-small} (1536-dim).
\subsection{Knowledge graph}
A knowledge graph was constructed by having GPT-4o-mini extract concepts and
relationships from representative chunks (first two and last two) of each
paper \cite{kg-extraction}. After normalization and deduplication, the KG
contains:
\begin{itemize}[nosep]
\item \textbf{559 concepts} (218 objects, 92 theorems, 77 definitions,
56 techniques, 28 persons, 26 formulas, 25 identities, 11
conjectures, and others)
\item \textbf{486 edges} with typed relationships (\emph{related\_to}:
110, \emph{uses}: 78, \emph{generalizes}: 54,
\emph{is\_instance\_of}: 45, \emph{implies}: 40, \emph{defines}: 39,
and others)
\item Coverage of all 75 papers
\end{itemize}
\section{Benchmark}
\label{sec:benchmark}
\subsection{Ground truth construction}
We construct a retrieval benchmark from the KG. For each concept $c$ with at
least $\text{min\_degree} = 2$ matched papers in the corpus:
\begin{itemize}[nosep]
\item \textbf{Query}: the concept's display name (e.g., ``Rogers--Ramanujan
identities'')
\item \textbf{Relevant documents}: all chunks from the concept's source
papers
\end{itemize}
This yields \textbf{108 queries}. The ground truth is approximate --- not
every chunk in a relevant paper directly discusses the concept --- but this
bias is consistent across models, making relative comparisons valid.
\subsection{Metrics}
We report:
\begin{itemize}[nosep]
\item \textbf{MRR} (Mean Reciprocal Rank): the average inverse rank of the
first relevant result.
\item \textbf{NDCG@$k$} (Normalized Discounted Cumulative Gain): measures
ranking quality with position-dependent discounting.
\item \textbf{Recall@$k$}: fraction of relevant documents retrieved in the
top $k$. Note that Recall@$k$ appears low because relevant sets are
large (often 100+ chunks per concept); MRR and NDCG are the
meaningful comparison metrics.
\end{itemize}
All metrics are computed using a Rust implementation with rayon parallelism
for batch kNN and metric aggregation \cite{rust-metrics}.
\subsection{Baseline results}
\begin{table}[h]
\centering
\caption{Baseline embedding model comparison on mathematical concept retrieval.
All models evaluated on 108 queries over 4,794 chunks.}
\label{tab:baselines}
\begin{tabular}{lcccccc}
\toprule
Model & Dim & R@5 & R@10 & R@20 & MRR & NDCG@10 \\
\midrule
\texttt{openai-small} & 1536 & 0.010 & 0.019 & 0.037 & \textbf{0.461} & \textbf{0.324} \\
SPECTER2 (proximity) & 768 & 0.007 & 0.013 & 0.024 & 0.360 & 0.225 \\
SciNCL & 768 & 0.006 & 0.012 & 0.024 & 0.306 & 0.205 \\
\midrule
Math-Embed (ours) & 768 & \textbf{0.030} & \textbf{0.058} & \textbf{0.111} & \textbf{0.816} & \textbf{0.736} \\
\bottomrule
\end{tabular}
\end{table}
The general-purpose OpenAI model outperforms both scientific models by a wide
margin (28\% higher MRR than SPECTER2, 51\% higher than SciNCL). This is
notable because SPECTER2 was trained on 6 million scientific citation triplets
--- yet it underperforms a model with no scientific specialization. We
attribute this to two factors:
\begin{enumerate}[nosep]
\item \textbf{Dimensionality}: OpenAI's 1536-dim space has more capacity
than the 768-dim BERT-based models.
\item \textbf{Task mismatch}: SPECTER2 and SciNCL were trained for
paper-to-paper similarity (title + abstract), not concept-to-chunk
retrieval. A query like ``Rogers--Ramanujan identities'' is not a
paper title --- it is a mathematical concept name, and retrieving
relevant passages requires understanding what that concept means.
\end{enumerate}
\section{Training Data from Knowledge Graphs}
\label{sec:training-data}
We generate contrastive training data automatically from the KG and corpus.
\subsection{Direct pairs}
For each concept $c$ with papers $P_1, \ldots, P_m$ in the KG, and each
paper $P_j$ with chunks $\{d_{j,1}, \ldots, d_{j,n_j}\}$ in the corpus:
\begin{align}
\text{Pairs}_{\text{name}}(c) &= \{(\texttt{name}(c),\; d_{j,k}) :
j \in [m],\; k \in [n_j]\} \\
\text{Pairs}_{\text{desc}}(c) &= \{(\texttt{desc}(c),\; d_{j,k}) :
j \in [m],\; k \in [n_j]\}
\end{align}
Using both the concept name and its description as anchors provides anchor
diversity: short anchors (e.g., ``Macdonald polynomials'') train exact-match
retrieval, while longer descriptions (e.g., ``A family of orthogonal
symmetric polynomials generalizing Schur functions'') train paraphrase
retrieval.
We cap at 20 chunks per concept to prevent over-representation of
high-degree concepts.
\subsection{Edge pairs}
For each edge $(c_1, c_2, r)$ in the KG with relation $r$ (e.g.,
\emph{generalizes}, \emph{uses}):
\begin{equation}
\text{Pairs}_{\text{edge}}(c_1, c_2) = \{(\texttt{name}(c_1),\; d) :
d \in \text{chunks}(c_2)\} \cup \{(\texttt{name}(c_2),\; d) :
d \in \text{chunks}(c_1)\}
\end{equation}
These cross-concept pairs teach the model that mathematically related concepts
should embed nearby. For example, if ``Bailey's lemma'' \emph{generalizes}
``Rogers--Ramanujan identities,'' then chunks about Rogers--Ramanujan should
be somewhat relevant to queries about Bailey's lemma.
We cap at 5 chunks per edge direction to prevent edge pairs from dominating
the dataset.
\subsection{Dataset statistics}
\begin{table}[h]
\centering
\caption{Training dataset statistics.}
\label{tab:dataset}
\begin{tabular}{lr}
\toprule
Direct pairs (concept $\to$ chunk) & 21,544 \\
Edge pairs (cross-concept) & 4,855 \\
Total unique pairs & 25,121 \\
Training set (90\%) & 22,609 \\
Validation set (10\%) & 2,512 \\
Unique anchors & 1,114 \\
\bottomrule
\end{tabular}
\end{table}
\section{Fine-Tuning}
\label{sec:finetuning}
\subsection{Method}
We fine-tune the SPECTER2 base model (\texttt{allenai/specter2\_base},
768-dim, SciBERT backbone) using the sentence-transformers framework
\cite{sbert}. Despite SPECTER2's poor off-the-shelf performance on our
benchmark, its pre-training on 6 million scientific citation triplets provides
a strong initialization for mathematical text --- the model already understands
scientific language structure, and we teach it mathematical concept semantics
on top.
\paragraph{Loss function.}
We use \texttt{MultipleNegativesRankingLoss} (MNRL) wrapped in
\texttt{MatryoshkaLoss}. MNRL treats all other examples in a batch as
negatives, providing $B(B-1)$ negative comparisons per batch of size $B$
without explicit negative mining. MatryoshkaLoss computes the same contrastive
loss at multiple embedding truncation points (768, 512, 256, 128 dimensions),
training the model to frontload important information into the first
dimensions.
\paragraph{Training details.}
\begin{itemize}[nosep]
\item Micro-batch size: 8, with gradient accumulation over 4 steps
(effective batch size 32, yielding 56 in-batch negative comparisons
per micro-batch)
\item Max sequence length: 256 tokens (truncating longer chunks)
\item Learning rate: $2 \times 10^{-5}$ with 10\% linear warmup
\item Epochs: 3 (2,118 optimization steps)
\item Duplicate-free batch sampling to maximize negative diversity
\item Final model selected after epoch 3 (training loss converged
from $\sim$11 to $\sim$5)
\item Hardware: Apple M-series GPU (MPS backend), $\sim$4 hours wall time
\end{itemize}
\subsection{Results}
\begin{table}[h]
\centering
\caption{Final comparison including fine-tuned model. All models evaluated
on 108 queries over 4,794 chunks. Best results in bold.}
\label{tab:final}
\begin{tabular}{lcccccc}
\toprule
Model & Dim & R@5 & R@10 & R@20 & MRR & NDCG@10 \\
\midrule
\texttt{openai-small} & 1536 & 0.010 & 0.019 & 0.037 & 0.461 & 0.324 \\
SPECTER2 (proximity) & 768 & 0.007 & 0.013 & 0.024 & 0.360 & 0.225 \\
SciNCL & 768 & 0.006 & 0.012 & 0.024 & 0.306 & 0.205 \\
\midrule
Math-Embed (ours) & 768 & \textbf{0.030} & \textbf{0.058} & \textbf{0.111} & \textbf{0.816} & \textbf{0.736} \\
\bottomrule
\end{tabular}
\end{table}
Our fine-tuned model outperforms all baselines by a wide margin.
MRR improves from 0.461 (OpenAI) to \textbf{0.816} --- a 77\% relative
improvement, meaning the first relevant result now appears on average at
rank $\sim$1.2 rather than rank $\sim$2.2. NDCG@10 more than doubles from
0.324 to 0.736, and Recall@20 triples from 0.037 to 0.111.
Remarkably, the fine-tuned model uses half the embedding dimensions (768
vs.\ 1536) of the OpenAI model yet dramatically outperforms it. The same
base model (SPECTER2) that scored worst among baselines (MRR 0.360) becomes
the best performer after fine-tuning --- a 127\% improvement from the same
architecture with no additional parameters, demonstrating that the
knowledge-graph-derived training signal is highly effective.
\section{Discussion}
\subsection{Why general-purpose models fail at math}
The poor performance of SPECTER2 and SciNCL --- models explicitly trained on
scientific literature --- highlights that \emph{scientific} training is not
the same as \emph{mathematical} training. These models learn paper-level
similarity from citation patterns: ``paper A cites paper B, so they should
embed nearby.'' But mathematical retrieval requires a different kind of
similarity: understanding that the text ``$\sum_{n=0}^{\infty}
\frac{q^{n^2}}{(q;q)_n}$'' is about the Rogers--Ramanujan identities, even
though it contains no occurrence of that phrase.
Standard tokenizers (BERT WordPiece) fragment mathematical notation into
meaningless subwords. Fine-tuning cannot fix the tokenizer, but it can teach
the model that certain patterns of subword tokens, when they appear together,
carry specific mathematical meaning.
\subsection{Knowledge graphs as supervision}
Our approach requires a knowledge graph, which itself requires an LLM
extraction step (GPT-4o-mini in our case). This may seem circular --- we use
an LLM to generate training data for a different model. But the key insight is
that these are \emph{complementary capabilities}:
\begin{itemize}[nosep]
\item The LLM excels at \emph{reading individual passages} and extracting
structured information (concepts, relationships), but is too slow
and expensive for real-time retrieval over thousands of chunks.
\item The embedding model excels at \emph{fast similarity search} over
large corpora, but needs training data to learn domain-specific
semantics.
\end{itemize}
The KG is a one-time cost that distills the LLM's understanding into a
reusable supervision signal.
\subsection{Generalizability}
The pipeline is not specific to algebraic combinatorics. Given:
\begin{enumerate}[nosep]
\item A corpus of mathematical papers (any subfield)
\item A knowledge graph over their concepts (extractable by LLM)
\end{enumerate}
the same code produces a domain-adapted embedding model. The fine-tuned model
should generalize to new papers in the same mathematical area, since it learns
\emph{concept semantics} rather than memorizing specific passages.
\section{Conclusion}
We demonstrated that general-purpose and scientific embedding models perform
poorly on mathematical concept retrieval, and presented a pipeline that
automatically generates contrastive training data from a knowledge graph to
fine-tune a domain-specific embedding model. Our approach requires no manual
annotation --- the knowledge graph provides the supervision signal --- and
produces a portable model that can be deployed in any RAG system for
mathematical literature.
Future work includes: (1) scaling to larger mathematical corpora spanning
multiple subfields, (2) incorporating mathematical notation awareness into
the tokenizer, and (3) exploring whether the fine-tuned model's understanding
of mathematical relationships transfers across subfields.
\begin{thebibliography}{10}
\bibitem{specter}
A.~Cohan, S.~Feldman, I.~Beltagy, D.~Downey, and D.~S.~Weld,
``SPECTER: Document-level representation learning using citation-informed
transformers,'' in \emph{Proc.\ ACL}, 2020.
\bibitem{specter2}
A.~Singh, M.~D'Arcy, A.~Cohan, D.~Downey, and S.~Feldman,
``SciRepEval: A multi-format benchmark for scientific document
representations,'' in \emph{Proc.\ EMNLP}, 2023.
\bibitem{scincl}
M.~Ostendorff, N.~Rethmeier, I.~Augenstein, B.~Gipp, and G.~Rehm,
``Neighborhood contrastive learning for scientific document
representations with citation embeddings,'' in \emph{Proc.\ EMNLP}, 2022.
\bibitem{scibert}
I.~Beltagy, K.~Lo, and A.~Cohan,
``SciBERT: A pretrained language model for scientific text,'' in
\emph{Proc.\ EMNLP}, 2019.
\bibitem{mathbert}
S.~Peng, K.~Yuan, L.~Gao, and Z.~Tang,
``MathBERT: A pre-trained model for mathematical formula understanding,''
\emph{arXiv:2105.00377}, 2021.
\bibitem{sbert}
N.~Reimers and I.~Gurevych,
``Sentence-BERT: Sentence embeddings using Siamese BERT-networks,'' in
\emph{Proc.\ EMNLP}, 2019.
\bibitem{matryoshka}
A.~Kusupati, G.~Bhatt, A.~Rege, M.~Wallingford, A.~Sinha, V.~Ramanujan,
W.~Howard-Snyder, K.~Chen, S.~Kakade, P.~Jain, and A.~Farhadi,
``Matryoshka representation learning,'' in \emph{Proc.\ NeurIPS}, 2022.
\bibitem{kg-extraction}
Knowledge graph extraction via LLM-based concept and relationship
identification from scientific text, internal methodology.
\bibitem{rust-metrics}
Custom Rust implementation of batch kNN and IR metrics (Recall@$k$, MRR,
NDCG@$k$) with rayon parallelism and PyO3 Python bindings.
\end{thebibliography}
\end{document}
|