| \documentclass[11pt,a4paper]{article} |
|
|
| \usepackage[utf8]{inputenc} |
| \usepackage[T1]{fontenc} |
| \usepackage{amsmath,amssymb,amsthm} |
| \usepackage{booktabs} |
| \usepackage{graphicx} |
| \usepackage{hyperref} |
| \usepackage[margin=1in]{geometry} |
| \usepackage{enumitem} |
| \usepackage{xcolor} |
| \usepackage{algorithm} |
| \usepackage{algpseudocode} |
|
|
| \hypersetup{ |
| colorlinks=true, |
| linkcolor=blue!70!black, |
| citecolor=green!50!black, |
| urlcolor=blue!70!black, |
| } |
|
|
| \title{Knowledge-Graph-Guided Fine-Tuning of Embedding Models\\ |
| for Mathematical Document Retrieval} |
| \author{Robin Langer\thanks{The author thanks Claude (Anthropic) for assistance with code development and manuscript preparation.}} |
| \date{} |
|
|
| \begin{document} |
|
|
| \maketitle |
|
|
| \begin{abstract} |
| We present a method for improving semantic search over mathematical research |
| papers by fine-tuning embedding models using contrastive learning, guided by |
| a knowledge graph extracted from the corpus. General-purpose embedding models |
| (e.g., OpenAI's \texttt{text-embedding-3-small}) and even scientific embedding |
| models (SPECTER2, SciNCL) perform poorly on mathematical retrieval tasks because |
| they lack understanding of the semantic relationships between mathematical |
| concepts. Our approach exploits an existing knowledge graph --- whose nodes are |
| mathematical concepts and whose edges encode relationships such as |
| \emph{generalizes}, \emph{proves}, and \emph{is\_instance\_of} --- to |
| automatically generate training data for contrastive fine-tuning. We benchmark |
| baseline models against our fine-tuned model on a retrieval task over 4,794 |
| paper chunks spanning 75 papers in algebraic combinatorics, and demonstrate |
| that domain-specific fine-tuning significantly outperforms all baselines. |
| The method is general: given any corpus of mathematical papers and a |
| knowledge graph over their concepts, the same pipeline produces a |
| domain-adapted embedding model. |
| \end{abstract} |
|
|
|
|
| \section{Introduction} |
|
|
| The increasing volume of mathematical literature makes automated retrieval |
| tools indispensable for researchers. A common approach is |
| \emph{retrieval-augmented generation} (RAG): chunk papers into passages, embed |
| them in a vector space, and retrieve relevant passages via nearest-neighbor |
| search over embeddings. The quality of retrieval depends critically on the |
| embedding model's ability to capture \emph{mathematical semantic similarity} |
| --- the idea that a query like ``Rogers--Ramanujan identities'' should retrieve |
| not only passages containing that exact phrase but also passages discussing |
| Bailey's lemma, $q$-series transformations, and partition identities. |
|
|
| General-purpose embedding models are trained on broad web text and lack this |
| kind of domain knowledge. Scientific embedding models such as SPECTER2 |
| \cite{specter2} and SciNCL \cite{scincl} are trained on citation graphs from |
| Semantic Scholar, but mathematics is underrepresented in their training data, |
| and they are optimized for \emph{paper-to-paper} similarity rather than |
| \emph{concept-to-passage} retrieval. |
|
|
| We address this gap by fine-tuning an embedding model specifically for |
| mathematical concept retrieval. Our key insight is that a \textbf{knowledge |
| graph} (KG) extracted from the corpus provides exactly the supervision signal |
| needed for contrastive learning: |
| \begin{itemize}[nosep] |
| \item Each KG concept (e.g., ``Macdonald polynomials'') maps to specific |
| papers, and hence to specific text chunks. These form |
| \emph{positive pairs} for contrastive training. |
| \item KG edges (e.g., ``Bailey's lemma \emph{generalizes} |
| Rogers--Ramanujan identities'') provide \emph{cross-concept |
| positives} that teach the model about mathematical relationships. |
| \item In-batch negatives from unrelated concepts provide the contrastive |
| signal automatically. |
| \end{itemize} |
|
|
| This paper makes the following contributions: |
| \begin{enumerate}[nosep] |
| \item A benchmark comparing general-purpose and scientific embedding |
| models on mathematical concept retrieval (Section~\ref{sec:benchmark}). |
| \item A method for automatically generating contrastive training data from |
| a knowledge graph (Section~\ref{sec:training-data}). |
| \item A fine-tuned embedding model that outperforms all baselines on our |
| benchmark (Section~\ref{sec:finetuning}). |
| \item An open-source pipeline\footnote{Code available at |
| \url{https://github.com/RaggedR/embeddings}. Model available at |
| \url{https://huggingface.co/RobBobin/math-embed}.} that can be applied to any |
| mathematical corpus with an associated knowledge graph. |
| \end{enumerate} |
|
|
|
|
| \section{Related Work} |
|
|
| \paragraph{Scientific document embeddings.} |
| SPECTER \cite{specter} introduced citation-based contrastive learning for |
| scientific document embeddings, training on (paper, cited paper, non-cited |
| paper) triplets. SPECTER2 \cite{specter2} extended this to 6 million citation |
| triplets across 23 fields of study and introduced task-specific adapters |
| (proximity, classification, regression). SciNCL \cite{scincl} improved on |
| SPECTER by using citation graph \emph{neighborhood} sampling for harder |
| negatives. All three models use SciBERT \cite{scibert} as their backbone and |
| produce 768-dimensional embeddings. |
|
|
| \paragraph{Mathematics-specific models.} |
| MathBERT \cite{mathbert} pre-trained BERT on mathematical curricula and arXiv |
| abstracts, but only with masked language modeling --- it was not contrastively |
| trained for retrieval. No widely adopted embedding model exists that is |
| specifically trained for mathematical semantic similarity. |
|
|
| \paragraph{Contrastive fine-tuning.} |
| The sentence-transformers framework \cite{sbert} provides |
| \texttt{MultipleNegativesRankingLoss} (MNRL), which treats all other examples |
| in a batch as negatives. Matryoshka Representation Learning \cite{matryoshka} |
| trains embeddings so that any prefix of the full vector is itself a useful |
| embedding, enabling flexible dimensionality--quality tradeoffs at inference. |
|
|
|
|
| \section{Data} |
| \label{sec:data} |
|
|
| \subsection{Corpus} |
|
|
| Our corpus consists of 75 research papers in algebraic combinatorics, |
| $q$-series, and related areas, sourced from arXiv. Papers are chunked into |
| passages of up to 1,500 characters with 200-character overlap, yielding |
| \textbf{4,794 chunks}. The chunks are stored in a ChromaDB vector database |
| with embeddings from OpenAI's \texttt{text-embedding-3-small} (1536-dim). |
|
|
| \subsection{Knowledge graph} |
|
|
| A knowledge graph was constructed by having GPT-4o-mini extract concepts and |
| relationships from representative chunks (first two and last two) of each |
| paper \cite{kg-extraction}. After normalization and deduplication, the KG |
| contains: |
| \begin{itemize}[nosep] |
| \item \textbf{559 concepts} (218 objects, 92 theorems, 77 definitions, |
| 56 techniques, 28 persons, 26 formulas, 25 identities, 11 |
| conjectures, and others) |
| \item \textbf{486 edges} with typed relationships (\emph{related\_to}: |
| 110, \emph{uses}: 78, \emph{generalizes}: 54, |
| \emph{is\_instance\_of}: 45, \emph{implies}: 40, \emph{defines}: 39, |
| and others) |
| \item Coverage of all 75 papers |
| \end{itemize} |
|
|
|
|
| \section{Benchmark} |
| \label{sec:benchmark} |
|
|
| \subsection{Ground truth construction} |
|
|
| We construct a retrieval benchmark from the KG. For each concept $c$ with at |
| least $\text{min\_degree} = 2$ matched papers in the corpus: |
| \begin{itemize}[nosep] |
| \item \textbf{Query}: the concept's display name (e.g., ``Rogers--Ramanujan |
| identities'') |
| \item \textbf{Relevant documents}: all chunks from the concept's source |
| papers |
| \end{itemize} |
|
|
| This yields \textbf{108 queries}. The ground truth is approximate --- not |
| every chunk in a relevant paper directly discusses the concept --- but this |
| bias is consistent across models, making relative comparisons valid. |
|
|
| \subsection{Metrics} |
|
|
| We report: |
| \begin{itemize}[nosep] |
| \item \textbf{MRR} (Mean Reciprocal Rank): the average inverse rank of the |
| first relevant result. |
| \item \textbf{NDCG@$k$} (Normalized Discounted Cumulative Gain): measures |
| ranking quality with position-dependent discounting. |
| \item \textbf{Recall@$k$}: fraction of relevant documents retrieved in the |
| top $k$. Note that Recall@$k$ appears low because relevant sets are |
| large (often 100+ chunks per concept); MRR and NDCG are the |
| meaningful comparison metrics. |
| \end{itemize} |
|
|
| All metrics are computed using a Rust implementation with rayon parallelism |
| for batch kNN and metric aggregation \cite{rust-metrics}. |
|
|
| \subsection{Baseline results} |
|
|
| \begin{table}[h] |
| \centering |
| \caption{Baseline embedding model comparison on mathematical concept retrieval. |
| All models evaluated on 108 queries over 4,794 chunks.} |
| \label{tab:baselines} |
| \begin{tabular}{lcccccc} |
| \toprule |
| Model & Dim & R@5 & R@10 & R@20 & MRR & NDCG@10 \\ |
| \midrule |
| \texttt{openai-small} & 1536 & 0.010 & 0.019 & 0.037 & \textbf{0.461} & \textbf{0.324} \\ |
| SPECTER2 (proximity) & 768 & 0.007 & 0.013 & 0.024 & 0.360 & 0.225 \\ |
| SciNCL & 768 & 0.006 & 0.012 & 0.024 & 0.306 & 0.205 \\ |
| \midrule |
| Math-Embed (ours) & 768 & \textbf{0.030} & \textbf{0.058} & \textbf{0.111} & \textbf{0.816} & \textbf{0.736} \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| The general-purpose OpenAI model outperforms both scientific models by a wide |
| margin (28\% higher MRR than SPECTER2, 51\% higher than SciNCL). This is |
| notable because SPECTER2 was trained on 6 million scientific citation triplets |
| --- yet it underperforms a model with no scientific specialization. We |
| attribute this to two factors: |
| \begin{enumerate}[nosep] |
| \item \textbf{Dimensionality}: OpenAI's 1536-dim space has more capacity |
| than the 768-dim BERT-based models. |
| \item \textbf{Task mismatch}: SPECTER2 and SciNCL were trained for |
| paper-to-paper similarity (title + abstract), not concept-to-chunk |
| retrieval. A query like ``Rogers--Ramanujan identities'' is not a |
| paper title --- it is a mathematical concept name, and retrieving |
| relevant passages requires understanding what that concept means. |
| \end{enumerate} |
|
|
|
|
| \section{Training Data from Knowledge Graphs} |
| \label{sec:training-data} |
|
|
| We generate contrastive training data automatically from the KG and corpus. |
|
|
| \subsection{Direct pairs} |
|
|
| For each concept $c$ with papers $P_1, \ldots, P_m$ in the KG, and each |
| paper $P_j$ with chunks $\{d_{j,1}, \ldots, d_{j,n_j}\}$ in the corpus: |
| \begin{align} |
| \text{Pairs}_{\text{name}}(c) &= \{(\texttt{name}(c),\; d_{j,k}) : |
| j \in [m],\; k \in [n_j]\} \\ |
| \text{Pairs}_{\text{desc}}(c) &= \{(\texttt{desc}(c),\; d_{j,k}) : |
| j \in [m],\; k \in [n_j]\} |
| \end{align} |
|
|
| Using both the concept name and its description as anchors provides anchor |
| diversity: short anchors (e.g., ``Macdonald polynomials'') train exact-match |
| retrieval, while longer descriptions (e.g., ``A family of orthogonal |
| symmetric polynomials generalizing Schur functions'') train paraphrase |
| retrieval. |
|
|
| We cap at 20 chunks per concept to prevent over-representation of |
| high-degree concepts. |
|
|
| \subsection{Edge pairs} |
|
|
| For each edge $(c_1, c_2, r)$ in the KG with relation $r$ (e.g., |
| \emph{generalizes}, \emph{uses}): |
| \begin{equation} |
| \text{Pairs}_{\text{edge}}(c_1, c_2) = \{(\texttt{name}(c_1),\; d) : |
| d \in \text{chunks}(c_2)\} \cup \{(\texttt{name}(c_2),\; d) : |
| d \in \text{chunks}(c_1)\} |
| \end{equation} |
|
|
| These cross-concept pairs teach the model that mathematically related concepts |
| should embed nearby. For example, if ``Bailey's lemma'' \emph{generalizes} |
| ``Rogers--Ramanujan identities,'' then chunks about Rogers--Ramanujan should |
| be somewhat relevant to queries about Bailey's lemma. |
|
|
| We cap at 5 chunks per edge direction to prevent edge pairs from dominating |
| the dataset. |
|
|
| \subsection{Dataset statistics} |
|
|
| \begin{table}[h] |
| \centering |
| \caption{Training dataset statistics.} |
| \label{tab:dataset} |
| \begin{tabular}{lr} |
| \toprule |
| Direct pairs (concept $\to$ chunk) & 21,544 \\ |
| Edge pairs (cross-concept) & 4,855 \\ |
| Total unique pairs & 25,121 \\ |
| Training set (90\%) & 22,609 \\ |
| Validation set (10\%) & 2,512 \\ |
| Unique anchors & 1,114 \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
|
|
| \section{Fine-Tuning} |
| \label{sec:finetuning} |
|
|
| \subsection{Method} |
|
|
| We fine-tune the SPECTER2 base model (\texttt{allenai/specter2\_base}, |
| 768-dim, SciBERT backbone) using the sentence-transformers framework |
| \cite{sbert}. Despite SPECTER2's poor off-the-shelf performance on our |
| benchmark, its pre-training on 6 million scientific citation triplets provides |
| a strong initialization for mathematical text --- the model already understands |
| scientific language structure, and we teach it mathematical concept semantics |
| on top. |
|
|
| \paragraph{Loss function.} |
| We use \texttt{MultipleNegativesRankingLoss} (MNRL) wrapped in |
| \texttt{MatryoshkaLoss}. MNRL treats all other examples in a batch as |
| negatives, providing $B(B-1)$ negative comparisons per batch of size $B$ |
| without explicit negative mining. MatryoshkaLoss computes the same contrastive |
| loss at multiple embedding truncation points (768, 512, 256, 128 dimensions), |
| training the model to frontload important information into the first |
| dimensions. |
|
|
| \paragraph{Training details.} |
| \begin{itemize}[nosep] |
| \item Micro-batch size: 8, with gradient accumulation over 4 steps |
| (effective batch size 32, yielding 56 in-batch negative comparisons |
| per micro-batch) |
| \item Max sequence length: 256 tokens (truncating longer chunks) |
| \item Learning rate: $2 \times 10^{-5}$ with 10\% linear warmup |
| \item Epochs: 3 (2,118 optimization steps) |
| \item Duplicate-free batch sampling to maximize negative diversity |
| \item Final model selected after epoch 3 (training loss converged |
| from $\sim$11 to $\sim$5) |
| \item Hardware: Apple M-series GPU (MPS backend), $\sim$4 hours wall time |
| \end{itemize} |
|
|
| \subsection{Results} |
|
|
| \begin{table}[h] |
| \centering |
| \caption{Final comparison including fine-tuned model. All models evaluated |
| on 108 queries over 4,794 chunks. Best results in bold.} |
| \label{tab:final} |
| \begin{tabular}{lcccccc} |
| \toprule |
| Model & Dim & R@5 & R@10 & R@20 & MRR & NDCG@10 \\ |
| \midrule |
| \texttt{openai-small} & 1536 & 0.010 & 0.019 & 0.037 & 0.461 & 0.324 \\ |
| SPECTER2 (proximity) & 768 & 0.007 & 0.013 & 0.024 & 0.360 & 0.225 \\ |
| SciNCL & 768 & 0.006 & 0.012 & 0.024 & 0.306 & 0.205 \\ |
| \midrule |
| Math-Embed (ours) & 768 & \textbf{0.030} & \textbf{0.058} & \textbf{0.111} & \textbf{0.816} & \textbf{0.736} \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| Our fine-tuned model outperforms all baselines by a wide margin. |
| MRR improves from 0.461 (OpenAI) to \textbf{0.816} --- a 77\% relative |
| improvement, meaning the first relevant result now appears on average at |
| rank $\sim$1.2 rather than rank $\sim$2.2. NDCG@10 more than doubles from |
| 0.324 to 0.736, and Recall@20 triples from 0.037 to 0.111. |
|
|
| Remarkably, the fine-tuned model uses half the embedding dimensions (768 |
| vs.\ 1536) of the OpenAI model yet dramatically outperforms it. The same |
| base model (SPECTER2) that scored worst among baselines (MRR 0.360) becomes |
| the best performer after fine-tuning --- a 127\% improvement from the same |
| architecture with no additional parameters, demonstrating that the |
| knowledge-graph-derived training signal is highly effective. |
|
|
|
|
| \section{Discussion} |
|
|
| \subsection{Why general-purpose models fail at math} |
|
|
| The poor performance of SPECTER2 and SciNCL --- models explicitly trained on |
| scientific literature --- highlights that \emph{scientific} training is not |
| the same as \emph{mathematical} training. These models learn paper-level |
| similarity from citation patterns: ``paper A cites paper B, so they should |
| embed nearby.'' But mathematical retrieval requires a different kind of |
| similarity: understanding that the text ``$\sum_{n=0}^{\infty} |
| \frac{q^{n^2}}{(q;q)_n}$'' is about the Rogers--Ramanujan identities, even |
| though it contains no occurrence of that phrase. |
|
|
| Standard tokenizers (BERT WordPiece) fragment mathematical notation into |
| meaningless subwords. Fine-tuning cannot fix the tokenizer, but it can teach |
| the model that certain patterns of subword tokens, when they appear together, |
| carry specific mathematical meaning. |
|
|
| \subsection{Knowledge graphs as supervision} |
|
|
| Our approach requires a knowledge graph, which itself requires an LLM |
| extraction step (GPT-4o-mini in our case). This may seem circular --- we use |
| an LLM to generate training data for a different model. But the key insight is |
| that these are \emph{complementary capabilities}: |
| \begin{itemize}[nosep] |
| \item The LLM excels at \emph{reading individual passages} and extracting |
| structured information (concepts, relationships), but is too slow |
| and expensive for real-time retrieval over thousands of chunks. |
| \item The embedding model excels at \emph{fast similarity search} over |
| large corpora, but needs training data to learn domain-specific |
| semantics. |
| \end{itemize} |
|
|
| The KG is a one-time cost that distills the LLM's understanding into a |
| reusable supervision signal. |
|
|
| \subsection{Generalizability} |
|
|
| The pipeline is not specific to algebraic combinatorics. Given: |
| \begin{enumerate}[nosep] |
| \item A corpus of mathematical papers (any subfield) |
| \item A knowledge graph over their concepts (extractable by LLM) |
| \end{enumerate} |
| the same code produces a domain-adapted embedding model. The fine-tuned model |
| should generalize to new papers in the same mathematical area, since it learns |
| \emph{concept semantics} rather than memorizing specific passages. |
|
|
|
|
| \section{Conclusion} |
|
|
| We demonstrated that general-purpose and scientific embedding models perform |
| poorly on mathematical concept retrieval, and presented a pipeline that |
| automatically generates contrastive training data from a knowledge graph to |
| fine-tune a domain-specific embedding model. Our approach requires no manual |
| annotation --- the knowledge graph provides the supervision signal --- and |
| produces a portable model that can be deployed in any RAG system for |
| mathematical literature. |
|
|
| Future work includes: (1) scaling to larger mathematical corpora spanning |
| multiple subfields, (2) incorporating mathematical notation awareness into |
| the tokenizer, and (3) exploring whether the fine-tuned model's understanding |
| of mathematical relationships transfers across subfields. |
|
|
|
|
| \begin{thebibliography}{10} |
|
|
| \bibitem{specter} |
| A.~Cohan, S.~Feldman, I.~Beltagy, D.~Downey, and D.~S.~Weld, |
| ``SPECTER: Document-level representation learning using citation-informed |
| transformers,'' in \emph{Proc.\ ACL}, 2020. |
|
|
| \bibitem{specter2} |
| A.~Singh, M.~D'Arcy, A.~Cohan, D.~Downey, and S.~Feldman, |
| ``SciRepEval: A multi-format benchmark for scientific document |
| representations,'' in \emph{Proc.\ EMNLP}, 2023. |
|
|
| \bibitem{scincl} |
| M.~Ostendorff, N.~Rethmeier, I.~Augenstein, B.~Gipp, and G.~Rehm, |
| ``Neighborhood contrastive learning for scientific document |
| representations with citation embeddings,'' in \emph{Proc.\ EMNLP}, 2022. |
|
|
| \bibitem{scibert} |
| I.~Beltagy, K.~Lo, and A.~Cohan, |
| ``SciBERT: A pretrained language model for scientific text,'' in |
| \emph{Proc.\ EMNLP}, 2019. |
|
|
| \bibitem{mathbert} |
| S.~Peng, K.~Yuan, L.~Gao, and Z.~Tang, |
| ``MathBERT: A pre-trained model for mathematical formula understanding,'' |
| \emph{arXiv:2105.00377}, 2021. |
|
|
| \bibitem{sbert} |
| N.~Reimers and I.~Gurevych, |
| ``Sentence-BERT: Sentence embeddings using Siamese BERT-networks,'' in |
| \emph{Proc.\ EMNLP}, 2019. |
|
|
| \bibitem{matryoshka} |
| A.~Kusupati, G.~Bhatt, A.~Rege, M.~Wallingford, A.~Sinha, V.~Ramanujan, |
| W.~Howard-Snyder, K.~Chen, S.~Kakade, P.~Jain, and A.~Farhadi, |
| ``Matryoshka representation learning,'' in \emph{Proc.\ NeurIPS}, 2022. |
|
|
| \bibitem{kg-extraction} |
| Knowledge graph extraction via LLM-based concept and relationship |
| identification from scientific text, internal methodology. |
|
|
| \bibitem{rust-metrics} |
| Custom Rust implementation of batch kNN and IR metrics (Recall@$k$, MRR, |
| NDCG@$k$) with rayon parallelism and PyO3 Python bindings. |
|
|
| \end{thebibliography} |
|
|
| \end{document} |
|
|