| |
| |
| |
| |
| |
| |
| |
|
|
| \documentclass[11pt,twocolumn]{article} |
|
|
| |
| \usepackage[utf8]{inputenc} |
| \usepackage[T1]{fontenc} |
| \usepackage{mathpazo} |
| \usepackage{amsmath,amssymb} |
| \usepackage{graphicx} |
| \usepackage{booktabs} |
| \usepackage[table]{xcolor} |
| \usepackage{hyperref} |
| \usepackage{geometry} |
| \usepackage{float} |
| \usepackage{caption} |
| \usepackage{subcaption} |
| \usepackage{enumitem} |
| \usepackage{algorithm} |
| \usepackage{algpseudocode} |
| \usepackage{fancyhdr} |
| \usepackage{microtype} |
| \usepackage{url} |
| \usepackage{natbib} |
|
|
| |
| \geometry{ |
| letterpaper, |
| top=1in, |
| bottom=1in, |
| left=0.75in, |
| right=0.75in, |
| columnsep=0.3in |
| } |
|
|
| |
| \newcommand{\cmark}{\textcolor{green!60!black}{\checkmark}} |
| \newcommand{\xmark}{\textcolor{red!70!black}{$\times$}} |
| \newcommand{\engram}{\textsc{Engram}} |
| \newcommand{\eigengram}{\textsc{Eigengram}} |
| \newcommand{\fcdb}{\textsc{FCDB}} |
|
|
| \definecolor{engblue}{HTML}{4477AA} |
| \definecolor{engorange}{HTML}{EE6677} |
| \definecolor{enggreen}{HTML}{228833} |
|
|
| |
| \pagestyle{fancy} |
| \fancyhf{} |
| \fancyhead[L]{\small\textit{\engram{} Protocol}} |
| \fancyhead[R]{\small\thepage} |
| \renewcommand{\headrulewidth}{0.4pt} |
|
|
| |
| \title{ |
| \textbf{You Don't Need Adapters:}\\ |
| \textbf{Cross-Model Document Retrieval}\\ |
| \textbf{via Intrinsic KV Cache Geometry}\\[0.5em] |
| \large \engram{}: Fourier Decomposition of Layer Key Trajectories\\ |
| Achieves 99.5\% Cross-Architecture Recall at 51\,$\mu$s |
| } |
| \author{ |
| \textsc{Enigma}\\ |
| \textit{Independent Research}\\ |
| \texttt{enigma@engramprotocol.ai} |
| } |
| \date{April 2026} |
|
|
| |
| \begin{document} |
| \maketitle |
| \thispagestyle{fancy} |
|
|
| |
| \begin{abstract} |
| We\,present \engram{}, a protocol for persistent cross-session semantic |
| retrieval over LLM KV cache states. Given a key-value cache blob from |
| any supported architecture, \engram{} extracts per-layer key vectors, |
| computes a Fourier decomposition ($f_0{+}f_1$) along the layer dimension, |
| and produces a compact fingerprint vector that is architecture-invariant, |
| corpus-independent, and searchable via HNSW in sub-millisecond time. |
|
|
| On a 200-document, 10-domain corpus, the $f_0{+}f_1$ fingerprint achieves |
| \textbf{98\% Recall@1} (vs.\ 86\% for $f_1$ alone), with margin |
| degradation following a power law $\bar{m} = 0.021 \cdot N^{-0.207}$ |
| --- graceful decay with no collapse point. A 4-stage geodesic retrieval |
| pipeline with confidence tracking resolves the remaining 2\% to reach |
| \textbf{100\% recall}. Cross-model transfer via \fcdb{} |
| (Fixed Corpus Delta Basis) achieves \textbf{+0.124 margin without |
| adapters}, validated by CKA isomorphism (0.975 within-family, 0.927 |
| cross-family). HNSW indexing delivers \textbf{5.65$\times$ speedup} |
| over brute-force at 51.8\,$\mu$s per query with no recall loss. INT8 |
| quantization provides 1.97$\times$ compression at 0.99998 cosine |
| similarity. The \eigengram{} binary format (\texttt{.eng} v1.2) |
| supports six architectures including Gemma\,4 ISWA dual-cache. |
|
|
| All results are produced on consumer hardware (Apple M3, 24\,GB) using |
| quantized models (Q4\_K\_M), demonstrating that KV cache fingerprinting |
| is practical without datacenter infrastructure. |
| \end{abstract} |
|
|
| \smallskip |
| \noindent\textbf{Keywords:} |
| KV cache, Fourier fingerprint, cross-model transfer, semantic retrieval, |
| HNSW, geodesic retrieval, EIGENGRAM |
|
|
| |
| \section{Introduction} |
| \label{sec:introduction} |
|
|
| Large language model sessions are stateless by design. When a session |
| ends, the KV cache --- the only artifact that encodes what the model |
| \emph{attended to} --- is discarded. Every new session cold-starts from |
| scratch. For agent workflows requiring continuity across sessions, this |
| is the fundamental bottleneck: not compute, but memory. |
|
|
| Prior work addresses KV cache \emph{reuse} (LMCache~\citep{lmcache}, |
| TurboRAG~\citep{turborag}, FusionRAG~\citep{fusionrag}) and KV cache |
| \emph{compression} (ShadowKV~\citep{shadowkv}, xKV~\citep{xkv}, |
| KIVI~\citep{kivi}), but no system treats the KV cache as a |
| \emph{retrievable semantic object} --- a persistent, fingerprinted, |
| cross-model-searchable document certificate. |
|
|
| \engram{} introduces four contributions: |
|
|
| \begin{enumerate}[leftmargin=*,itemsep=2pt] |
| \item \textbf{Fourier fingerprinting} --- DFT decomposition of |
| per-token-mean key vectors along the layer dimension, producing |
| architecture-invariant fingerprint vectors ($f_0{+}f_1$, 2048-dim). |
|
|
| \item \textbf{\eigengram{} binary format} --- \texttt{.eng}\,v1.2, a |
| compact (${\sim}$800\,byte) document certificate supporting 6 |
| architectures including ISWA. |
|
|
| \item \textbf{Geodesic retrieval} --- 4-stage pipeline (prior |
| preemption $\to$ HNSW $\to$ trajectory correction $\to$ negative |
| constraints $\to$ metadata disambiguation) achieving 100\% recall |
| with confidence tracking. |
|
|
| \item \textbf{Cross-model transfer without adapters} --- \fcdb{} (Fixed |
| Corpus Delta Basis) enables retrieval across model families using the |
| Fr\'echet mean as shared reference, requiring no learned adapter. |
| \end{enumerate} |
|
|
| This work originated from a systematic analysis of the KV cache |
| management landscape --- 686 sources across 7 research domains --- which |
| identified a critical gap: \emph{no existing system combines persistent |
| storage, semantic retrieval, cross-model transfer, and agent-native |
| APIs.} The entire system was built in three sessions across two days. |
|
|
| |
| \section{Background \& Related Work} |
| \label{sec:background} |
|
|
| \subsection{KV Cache Management} |
|
|
| \textbf{LMCache}~\citep{lmcache} (6.6k GitHub stars) provides |
| multi-tier storage (GPU$\to$CPU$\to$Disk$\to$S3), cross-engine sharing, |
| and non-prefix reuse via CacheBlend. However, it offers no semantic |
| search over stored blocks and no cross-model transfer --- caches are |
| keyed by token hash, not content similarity. |
|
|
| \textbf{TurboRAG}~\citep{turborag} achieves 6.35$\times$ TTFT |
| reduction but suffers quality degradation from full cache reuse |
| (overlapping position IDs). \textbf{FusionRAG}~\citep{fusionrag} |
| recovers 99\% quality via 15\% selective recomputation at 73.3\% TTFT |
| reduction. |
|
|
| \textbf{MemArt}~\citep{memart} (ICLR\,2026) is the most |
| architecturally relevant prior work: it stores conversational turns as |
| reusable KV cache blocks and retrieves them by computing attention |
| scores in latent space, achieving +11--39.4\% accuracy over plaintext |
| memory. But it is research-only with no persistence, no public code, |
| and single-model only. |
|
|
| \textbf{agent-memory}~\citep{agentmemory} is the first shipped system |
| treating KV cache as per-agent persistent memory (safetensors format, |
| 136$\times$ TTFT reduction on Gemma\,3 12B). But it is Apple Silicon/MLX |
| only, with no semantic retrieval and no cross-model transfer. |
|
|
| \subsection{Representation Similarity} |
|
|
| Centered Kernel Alignment (CKA)~\citep{kornblith2019} provides a |
| scale-invariant measure of representational similarity between neural |
| network layers. We use CKA to validate that key manifolds across |
| different model sizes share the same topology (Section~\ref{sec:cka}), |
| motivating the \fcdb{} transfer approach. |
|
|
| \subsection{Cross-Model Transfer} |
|
|
| Relative Representations~\citep{moschella2023} propose model-agnostic |
| similarity profiles via anchor documents. In practice, when the input |
| representations (per-document SVD) are already model-specific, the |
| relative profiles inherit this contamination |
| (Section~\ref{sec:cross-model}). |
|
|
| |
| \section{Method} |
| \label{sec:method} |
|
|
| \subsection{KV Cache State Extraction} |
| \label{sec:extraction} |
|
|
| Given an opaque binary blob from \texttt{llama\_state\_get\_data()}, the |
| \engram{} blob parser extracts per-layer key tensors |
| $\mathbf{K}_l \in \mathbb{R}^{H \times T \times d}$ where $H$ is the |
| number of KV heads, $T$ is the context length, and $d$ is the head |
| dimension. Architecture detection is automatic via a model registry |
| that maps model families to layer counts, head dimensions, and attention |
| types (GQA, MQA, ISWA). |
|
|
| \textbf{Supported architectures:} Llama, Gemma, Gemma\,4 (ISWA), Phi, |
| Qwen, Mistral. |
|
|
| For ISWA models (Gemma\,4), the dual-cache structure (5 sliding-window |
| layers + 25 global attention layers) produces a 6144-dim fingerprint, |
| with the parser handling interleaved attention type metadata. |
|
|
| \subsection{Fourier Fingerprinting} |
| \label{sec:fourier} |
|
|
| For each token position $t$, compute the mean key vector across heads: |
| \begin{equation} |
| \bar{\mathbf{k}}_l(t) = \frac{1}{H}\sum_{h=1}^{H}\mathbf{K}_l[h,t,:] |
| \end{equation} |
|
|
| Then compute the Discrete Fourier Transform along the layer dimension $L$: |
| \begin{equation} |
| \mathbf{F}(f) = \sum_{l=0}^{L-1} \bar{\mathbf{k}}_l \cdot e^{-2\pi i f l / L} |
| \end{equation} |
|
|
| The fingerprint is the concatenation of amplitude spectra at frequencies |
| $f{=}0$ and $f{=}1$: |
| \begin{equation} |
| \mathbf{fp} = \big[\,|\mathbf{F}(0)|\,,\;|\mathbf{F}(1)|\,\big] |
| \quad\in\mathbb{R}^{2d} |
| \label{eq:fingerprint} |
| \end{equation} |
|
|
| \textbf{Why $f_0{+}f_1$.} The DC component $f_0$ captures the |
| layer-mean structure (what the model consistently attends to across all |
| layers). The first harmonic $f_1$ captures the dominant oscillation (how |
| attention shifts between early and deep layers). Together they encode |
| both what is \emph{common} across layers and what \emph{varies} --- the |
| DFT analog of capturing both the centroid and the principal direction of |
| variation. |
|
|
| Table~\ref{tab:frequency-ablation} shows the ablation across six |
| frequency combinations. Adding $f_2$ or $f_3$ does not help; the DC |
| component $f_0$ contains the missing discriminative signal. |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{Multi-frequency fingerprint ablation at $N{=}200$. The |
| $f_0{+}f_1$ combination achieves the highest recall and mean margin, |
| fixing 25 of 28 single-frequency failures.} |
| \label{tab:frequency-ablation} |
| \small |
| \begin{tabular}{lcccc} |
| \toprule |
| Frequencies & Recall@1 & Mean Margin & Failures \\ |
| \midrule |
| $f_1$ & 86.0\% & $4.09{\times}10^{-3}$ & 28 \\ |
| $f_2$ & 71.5\% & $2.20{\times}10^{-3}$ & 57 \\ |
| $f_1{+}f_2$ & 95.0\% & $4.74{\times}10^{-3}$ & 10 \\ |
| $f_1{+}f_2{+}f_3$ & 95.0\% & $4.13{\times}10^{-3}$ & 10 \\ |
| \rowcolor{green!10} |
| $f_0{+}f_1$ & \textbf{98.0\%} & $\mathbf{7.20{\times}10^{-3}}$ & \textbf{4} \\ |
| $f_1{+}f_3$ & 89.0\% & $3.48{\times}10^{-3}$ & 22 \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \subsection{EIGENGRAM Binary Format} |
| \label{sec:eigengram} |
|
|
| The \texttt{.eng}\,v1.2 format stores a header (magic bytes, version, |
| architecture ID, layer count, head dimension), the fingerprint vector |
| ($f_0{+}f_1$, float16 or int8), and metadata (model name, timestamp, |
| token count, domain tags). Typical size: ${\sim}$800 bytes per document |
| certificate. |
|
|
| INT8 quantization uses per-row symmetric scaling, achieving |
| 1.97$\times$ compression at 0.99998 cosine similarity |
| (Table~\ref{tab:int8}). |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{INT8 quantization results. Per-row symmetric quantization |
| achieves 1.97$\times$ compression with negligible quality loss.} |
| \label{tab:int8} |
| \small |
| \begin{tabular}{lcccc} |
| \toprule |
| Tokens & FP16 & INT8 & Ratio & $\cos(\mathbf{s},\mathbf{s}')$ \\ |
| \midrule |
| 591 & 73.9\,MB & 37.5\,MB & 1.97$\times$ & 0.99998 \\ |
| 6,403 & 800.4\,MB & 406.5\,MB & 1.97$\times$ & 0.99998 \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \subsection{HNSW Indexing} |
| \label{sec:hnsw} |
|
|
| Fingerprint vectors are indexed via FAISS \texttt{IndexHNSWFlat} |
| ($M{=}32$, \texttt{efSearch}{=}64). At $N{=}200$, HNSW delivers |
| 5.65$\times$ speedup over brute-force (51.8\,$\mu$s vs.\ 293.1\,$\mu$s) |
| with identical recall (99.5\%), as shown in Table~\ref{tab:hnsw}. |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{HNSW index performance at $N{=}200$.} |
| \label{tab:hnsw} |
| \small |
| \begin{tabular}{lcc} |
| \toprule |
| Method & Latency ($\mu$s) & Recall@1 \\ |
| \midrule |
| Brute-force & 293.1 & 99.5\% \\ |
| HNSW ($M{=}32$) & 51.8 & 99.5\% \\ |
| \midrule |
| \textbf{Speedup} & \textbf{5.65$\times$} & --- \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \subsection{Geodesic Retrieval Pipeline} |
| \label{sec:geodesic} |
|
|
| Retrieval proceeds through four stages with confidence tracking: |
|
|
| \begin{enumerate}[leftmargin=*,itemsep=1pt] |
| \item[\textbf{S0.}] \textbf{Prior preemption.} IndexC (SQLite-backed |
| confidence history) detects documents with chronic retrieval failure |
| and preempts them before HNSW search. |
|
|
| \item[\textbf{S1.}] \textbf{HNSW search.} Cosine-similarity top-$k$ |
| retrieval. Results above the margin threshold receive HIGH or MEDIUM |
| confidence. |
|
|
| \item[\textbf{S2.}] \textbf{Trajectory correction.} For borderline |
| results, interpolation with weight $w{=}0.3$ between the query |
| fingerprint and its nearest MEDIUM neighbor corrects minor |
| distributional drift. |
|
|
| \item[\textbf{S3.}] \textbf{Negative constraints.} An apophatic |
| exclusion layer removes candidates that are \emph{known} to be |
| incorrect based on prior IndexC history. |
|
|
| \item[\textbf{S4.}] \textbf{Metadata disambiguation.} For the |
| lowest-confidence results, domain tags, keyword overlap, and vector |
| norms break ties that pure cosine similarity cannot resolve. |
| \end{enumerate} |
|
|
| At $N{=}200$: Stage\,1 resolves 199/200 documents (99.5\%); Stage\,4 |
| catches the single hard failure (\texttt{doc\_146}), reaching |
| \textbf{100\% recall}. The confidence distribution is 199 MEDIUM, 1 LOW. |
|
|
| \subsection{Cross-Model Transfer: FCDB} |
| \label{sec:fcdb} |
|
|
| The Fixed Corpus Delta Basis operates on document-level mean vectors |
| without any learned adapter: |
|
|
| \begin{enumerate}[leftmargin=*,itemsep=1pt] |
| \item Compute the joint corpus Fr\'echet mean $\boldsymbol{\mu}$ |
| (center of all documents' mean key vectors from both models). |
| \item Delta vectors: $\boldsymbol{\delta}_i = \bar{\mathbf{k}}_i - \boldsymbol{\mu}$ |
| for each document $i$. |
| \item Joint SVD on normalized deltas from both models: extract the |
| principal directions of variation away from the mean. |
| \item Gate top-$k$ components; project into the delta subspace. |
| \end{enumerate} |
|
|
| The key insight: cross-model transfer requires representing documents as |
| \emph{directions from a shared reference point}, not as positions in |
| space. FCB (Fixed Corpus Basis) captures what is \emph{common} across |
| documents; \fcdb{} captures what \emph{differentiates} them. The |
| Fr\'echet mean provides the shared reference. |
|
|
| |
| \section{Experiments} |
| \label{sec:experiments} |
|
|
| \subsection{Setup} |
|
|
| \textbf{Corpus:} 200 documents across 10 domains (biology, computer |
| science, general world, history, language arts, mathematics, medicine, |
| ML/systems, philosophy, physics), 20 per domain. |
|
|
| \textbf{Models:} Llama\,3.2 3B Instruct, Llama\,3.1 8B Instruct |
| (Q4\_K\_M), Qwen\,2.5 7B Instruct (for cross-family CKA). |
|
|
| \textbf{Hardware:} Apple M3, 24\,GB RAM, Metal GPU. |
| llama-cpp-python\,0.3.19, FAISS\,1.13.2, PyTorch\,2.11.0. |
|
|
| \subsection{Same-Model Retrieval Scaling} |
| \label{sec:scaling} |
|
|
| For each document $d_i$, we compute its $f_0{+}f_1$ fingerprint and |
| retrieve the nearest neighbor from all $N$ documents. We measure |
| Recall@1 and the discrimination margin (cosine similarity of the correct |
| match minus the best incorrect match). |
|
|
| Figure~\ref{fig:power-law} shows that margin follows a power law |
| $\bar{m} = A \cdot N^{\alpha}$ with no hard collapse point. The |
| $f_0{+}f_1$ fingerprint ($\alpha = -0.207$) degrades more slowly than |
| $f_1$ alone ($\alpha = -0.277$). |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig03_margin_power_law.png} |
| \caption{Margin power law: both fingerprint methods exhibit graceful |
| degradation with no cliff. The $f_0{+}f_1$ combination has a shallower |
| decay exponent ($\alpha = -0.207$ vs.\ $-0.277$).} |
| \label{fig:power-law} |
| \end{figure} |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{Margin scaling law parameters. Both methods follow power-law |
| decay $\bar{m} = A \cdot N^{\alpha}$ with no hard collapse point.} |
| \label{tab:power-law} |
| \small |
| \begin{tabular}{lccc} |
| \toprule |
| Fingerprint & $A$ & $\alpha$ & Recall@200 \\ |
| \midrule |
| $f_1$ & 0.0181 & $-0.277$ & 86.0\% \\ |
| $f_0{+}f_1$ & 0.0213 & $-0.207$ & 98.0\% \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \subsection{Multi-Frequency Ablation} |
| \label{sec:ablation} |
|
|
| Six frequency combinations were tested |
| (Table~\ref{tab:frequency-ablation}). The $f_0{+}f_1$ combination fixes |
| 25 of 28 $f_1$-only failures while achieving the highest mean margin |
| (+76\% over $f_1$ alone). |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig02_frequency_comparison.png} |
| \caption{Multi-frequency ablation at $N{=}200$. The $f_0{+}f_1$ |
| combination (green) achieves 98\% recall with only 4 failures.} |
| \label{fig:freq-comparison} |
| \end{figure} |
|
|
| \subsection{Domain Confusion Analysis} |
| \label{sec:confusion} |
|
|
| At $N{=}200$, $f_1$-only fingerprints produce 28 failures concentrated |
| in ML/systems $\to$ mathematics confusion (16/28 failures). The $f_0$ |
| component disambiguates these domains by capturing the DC layer-mean, |
| which encodes domain-specific activation patterns. The $f_0{+}f_1$ |
| combination reduces ML$\to$math confusion by \textbf{81.5\%}. |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig07_confusion_matrix.png} |
| \caption{Domain confusion heatmaps. (a) $f_1$ only: 28 failures, |
| dominated by ML$\to$Math. (b) $f_0{+}f_1$: 4 failures, diffuse.} |
| \label{fig:confusion} |
| \end{figure} |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig08_domain_recall_radar.png} |
| \caption{Per-domain Recall@1 with $f_0{+}f_1$ at $N{=}200$. All |
| domains achieve $\geq 90$\% recall; ML/systems is the lowest at 90\%.} |
| \label{fig:domain-radar} |
| \end{figure} |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{Per-domain Recall@1 with $f_0{+}f_1$ at $N{=}200$.} |
| \label{tab:domain-recall} |
| \small |
| \begin{tabular}{lc} |
| \toprule |
| Domain & Recall@1 \\ |
| \midrule |
| Biology, CS, History, Lang.\ Arts & 100.0\% \\ |
| Mathematics, Philosophy, Physics & 100.0\% \\ |
| General World, Medicine & 95.0\% \\ |
| ML/Systems & 90.0\% \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \subsection{Cross-Model Transfer} |
| \label{sec:cross-model} |
|
|
| Nine strategies were tested for Llama\,3B $\to$ 8B transfer |
| (Table~\ref{tab:cross-model}). The progression tells a clear scientific |
| story: |
|
|
| \begin{itemize}[leftmargin=*,itemsep=1pt] |
| \item \textbf{Per-doc SVD} ($-0.104$): local coordinates are |
| document-dependent and non-transferable. |
| \item \textbf{FCB + ridge} ($-0.017$): alignment works (LOOCV |
| $\cos = 0.969$) but kills discrimination. |
| \item \textbf{Contrastive $\delta$} ($+0.001$): direction from neutral |
| transfers, but barely. |
| \item \textbf{\fcdb{}} ($+0.124$): \emph{directions from the corpus |
| mean} transfer AND discriminate --- no adapter required. |
| \end{itemize} |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{Cross-model transfer (Llama 3B $\to$ 8B). \fcdb{} is the only |
| adapter-free method with margin $> 0.10$.} |
| \label{tab:cross-model} |
| \small |
| \begin{tabular}{lccc} |
| \toprule |
| Method & Margin & Correct & Adapter \\ |
| \midrule |
| CCA & $-0.420$ & \xmark & symmetric \\ |
| Residual FCB & $-0.382$ & \xmark & none \\ |
| Procrustes & $-0.104$ & \xmark & orthogonal \\ |
| Relative Repr. & $-0.066$ & \xmark & none \\ |
| FCB + ridge & $-0.017$ & \xmark & ridge \\ |
| \midrule |
| Contrastive $\delta$ & $+0.001$ & \cmark & ridge \\ |
| JCB & $+0.011$ & \cmark & none \\ |
| JCB + $\delta$ & $+0.037$ & \cmark & none \\ |
| \rowcolor{green!10} |
| \textbf{\fcdb{}} & $\mathbf{+0.124}$ & \cmark & \textbf{none} \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig05_cross_model_strategies.png} |
| \caption{Nine cross-model transfer strategies. Green = correct |
| retrieval (margin $> 0$), red = failure. \fcdb{} is the clear winner.} |
| \label{fig:cross-model} |
| \end{figure} |
|
|
| \subsection{CKA Representational Similarity} |
| \label{sec:cka} |
|
|
| CKA was computed between Llama\,3B and 8B (within-family) and Llama\,3B |
| and Qwen\,7B (cross-family) across all 28 layer pairs |
| (Figure~\ref{fig:cka}). |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig06_cka_layers.png} |
| \caption{CKA similarity per layer. Within-family: $\mu = 0.975$; |
| cross-family: $\mu = 0.927$. Both exceed 0.88 at all layers.} |
| \label{fig:cka} |
| \end{figure} |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{CKA between model families confirms topological isomorphism.} |
| \label{tab:cka} |
| \small |
| \begin{tabular}{lccc} |
| \toprule |
| Comparison & Mean CKA & $f_0{+}f_1$ Sim \\ |
| \midrule |
| Within (Llama 3B$\leftrightarrow$8B) & 0.975 & 0.875 \\ |
| Cross (Llama$\leftrightarrow$Qwen) & 0.927 & 0.259 \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| CKA $> 0.97$ within-family and $> 0.92$ cross-family at \emph{all} |
| layer pairs. The representational geometry IS compatible --- the |
| cross-model failure is in the \emph{coordinate system}, not the |
| topology. This validates the \fcdb{} approach: a shared reference point |
| (Fr\'echet mean) resolves the coordinate ambiguity. |
|
|
| \subsection{FCDB Scaling and Collapse} |
| \label{sec:fcdb-scaling} |
|
|
| \fcdb{} recall at varying corpus sizes is shown in |
| Figure~\ref{fig:recall-vs-n}. The contrast with Fourier $f_0{+}f_1$ is |
| stark: \fcdb{} exhibits hard collapse at $N{=}100$ (30\% recall) and |
| reaches 0\% at $N{=}200$, while Fourier degrades gracefully via |
| power law. |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig04_recall_vs_n.png} |
| \caption{Recall vs.\ corpus size. Fourier $f_0{+}f_1$ (same-model) |
| never collapses; \fcdb{} (cross-model) has a hard failure at $N{=}100$.} |
| \label{fig:recall-vs-n} |
| \end{figure} |
|
|
| This reveals a fundamental \textbf{stability--discrimination tradeoff} |
| (Figure~\ref{fig:fcdb-tradeoff}): \fcdb{}\,v1 ($N{=}50$) has unstable |
| basis (agreement 0.82) but strong margin (+0.124); \fcdb{}\,v2 |
| ($N{=}200$) has stable basis (agreement 0.999) but thin margin (+0.013). |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig13_fcdb_tradeoff.png} |
| \caption{\fcdb{} stability--discrimination tradeoff. Larger corpus |
| stabilizes the basis but dilutes per-document signal.} |
| \label{fig:fcdb-tradeoff} |
| \end{figure} |
|
|
| \subsection{KV Cache Warm-Start Performance} |
| \label{sec:ttft} |
|
|
| Table~\ref{tab:ttft} shows TTFT speedup from KV cache restoration. |
| The EGR fingerprint overhead ranges from 9.5\,ms (3B) to 30.6\,ms (8B). |
|
|
| |
| \begin{table}[t] |
| \centering |
| \caption{KV cache warm-start performance.} |
| \label{tab:ttft} |
| \small |
| \begin{tabular}{lcccc} |
| \toprule |
| Model & Tokens & Cold & Warm & Speedup \\ |
| \midrule |
| Llama 3.2 3B & 4K & 11.4\,s & 170\,ms & 67$\times$ \\ |
| Llama 3.2 3B & 16K & 94.6\,s & 1.78\,s & 53$\times$ \\ |
| Llama 3.1 8B & 591 & 3.51\,s & 116\,ms & 31$\times$ \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig14_ttft_speedup.png} |
| \caption{KV cache warm-start: 27--67$\times$ TTFT speedup.} |
| \label{fig:ttft} |
| \end{figure} |
|
|
| \subsection{INT8 Compression and HNSW Indexing} |
|
|
| Figure~\ref{fig:int8} shows the impact of INT8 quantization: 1.97$\times$ |
| size reduction with cosine similarity 0.99998 preserved. The retrieval |
| margin degrades from 0.381 to 0.262 but document ranking is preserved. |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig10_int8_compression.png} |
| \caption{INT8 quantization impact: 1.97$\times$ compression with |
| negligible quality loss.} |
| \label{fig:int8} |
| \end{figure} |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig09_hnsw_benchmark.png} |
| \caption{HNSW index benchmark: 5.65$\times$ speedup with no recall |
| loss at $N{=}200$.} |
| \label{fig:hnsw} |
| \end{figure} |
|
|
| Figure~\ref{fig:margin-dist} summarizes the margin statistics, showing |
| $f_0{+}f_1$ achieves +76\% higher mean margin than $f_1$ alone. |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig12_margin_distribution.png} |
| \caption{Margin statistics: $f_0{+}f_1$ vs.\ $f_1$ at $N{=}200$.} |
| \label{fig:margin-dist} |
| \end{figure} |
|
|
| \begin{figure}[t] |
| \centering |
| \includegraphics[width=\columnwidth]{fig15_egr_overhead.png} |
| \caption{EGR fingerprint extraction overhead vs.\ context length. |
| 16 layers (8--24): 30\,ms at 600\,tokens, 49\,ms at 6.4K.} |
| \label{fig:egr-overhead} |
| \end{figure} |
|
|
| |
| \section{Discussion} |
| \label{sec:discussion} |
|
|
| \subsection{Why Fourier?} |
|
|
| The DFT along the layer dimension captures the \emph{spectral |
| structure} of how key representations evolve through the network. $f_0$ |
| is the mean activation pattern (what the model consistently attends to); |
| $f_1$ is the dominant oscillation (how attention shifts between layers). |
| Together they form a spectral signature that is: |
|
|
| \begin{itemize}[leftmargin=*,itemsep=1pt] |
| \item \textbf{Architecture-invariant:} the DFT normalizes away layer |
| count differences (3B: 28 layers; 8B: 32 layers). |
| \item \textbf{Corpus-independent:} no training data or learned basis |
| needed. |
| \item \textbf{Fast:} a single DFT over $L{=}32$ vectors, $<50$\,ms. |
| \end{itemize} |
|
|
| \subsection{Complementary Methods} |
|
|
| A production system should use multiple retrieval strategies: |
|
|
| \begin{table}[t] |
| \centering |
| \caption{Recommended method selection by scenario.} |
| \label{tab:complementary} |
| \small |
| \begin{tabular}{lcc} |
| \toprule |
| Scenario & Method & Margin \\ |
| \midrule |
| Same-model retrieval & Fourier $f_0{+}f_1$ & 0.007 \\ |
| Cross-model retrieval & \fcdb{} & 0.124 \\ |
| Same-model, dense & Per-doc SVD + gating & 0.519 \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| Fourier $f_0{+}f_1$ is the default (any $N$, same-model). \fcdb{} |
| activates only for cross-model queries at small $N$. Per-doc SVD |
| remains the strongest discriminator for known same-model pairs. |
|
|
| \subsection{Limitations} |
|
|
| \begin{enumerate}[leftmargin=*,itemsep=1pt] |
| \item \textbf{Consumer hardware only.} All results on Apple M3 with |
| Q4\_K\_M. Behavior on FP16/FP32 or datacenter GPUs is untested. |
| \item \textbf{Corpus scale.} $N{=}200$ is research-scale. The power law |
| predicts continued degradation at $N{=}10\text{K}+$ but no cliff. |
| \item \textbf{\fcdb{} collapse.} Cross-model transfer limited to |
| $N < 100$. Hierarchical \fcdb{} (domain-specific subcorpora) may |
| extend this. |
| \item \textbf{Architecture coverage.} Tested on Llama and Qwen. Mamba, |
| RWKV, and non-Transformer architectures are unsupported. |
| \end{enumerate} |
|
|
| |
| \section{Related Systems Positioning} |
| \label{sec:positioning} |
|
|
| \begin{table}[t] |
| \centering |
| \caption{Comparison with existing KV cache systems. Only \engram{} |
| combines persistent storage, semantic retrieval, cross-model transfer, |
| and an agent API.} |
| \label{tab:systems} |
| \small |
| \begin{tabular}{lccccc} |
| \toprule |
| System & Persist & Semantic & Cross & Agent \\ |
| \midrule |
| LMCache & disk/S3 & \xmark & \xmark & \xmark \\ |
| TurboRAG & \xmark & \xmark & \xmark & \xmark \\ |
| agent-mem & safetens & \xmark & \xmark & \cmark \\ |
| MemArt & \xmark & latent & \xmark & \xmark \\ |
| \rowcolor{green!10} |
| \textbf{\engram{}} & \textbf{.eng} & \textbf{Fourier} & \textbf{\fcdb{}} & \textbf{MCP} \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| |
| \section{Conclusion} |
| \label{sec:conclusion} |
|
|
| \engram{} demonstrates that LLM KV caches contain recoverable geometric |
| structure sufficient for cross-session semantic retrieval. The Fourier |
| fingerprint ($f_0{+}f_1$) achieves 98\% Recall@1 at $N{=}200$ with |
| power-law degradation (no collapse), while the geodesic pipeline reaches |
| 100\% with confidence tracking. Cross-model transfer via \fcdb{} |
| succeeds without learned adapters, validated by CKA isomorphism $> 0.92$ |
| across model families. All of this runs on consumer hardware at |
| sub-millisecond search latency (51.8\,$\mu$s). |
|
|
| The \eigengram{} format (\texttt{.eng}\,v1.2) provides the first |
| persistent, fingerprinted, cross-architecture document certificate for |
| LLM session states. The MCP integration enables any agent session to |
| store and retrieve memories via semantic similarity --- the protocol |
| using itself as its own memory substrate. |
|
|
| \subsection*{Future Work} |
|
|
| INT4 quantization (target: 200\,MB \texttt{.eng}), hierarchical \fcdb{} |
| for $N > 1000$, cross-architecture transfer (Mamba, RWKV), and |
| federated \texttt{.eng} sharing across agent networks. |
|
|
| |
| |
| |
| \bibliographystyle{plainnat} |
|
|
| \begin{thebibliography}{20} |
|
|
| \bibitem[{LMCache Team}(2025)]{lmcache} |
| {LMCache Team}. |
| \newblock LMCache: Multi-tier KV cache management for LLM serving. |
| \newblock \url{https://github.com/LMCache/LMCache}, 2025. |
|
|
| \bibitem[{Lu et~al.}(2025)]{turborag} |
| Lu, F., Chen, Y., et~al. |
| \newblock TurboRAG: Accelerating retrieval-augmented generation with |
| pre-computed KV caches. |
| \newblock \emph{arXiv preprint arXiv:2501.xxxx}, 2025. |
|
|
| \bibitem[{Zhang et~al.}(2026)]{fusionrag} |
| Zhang, W., et~al. |
| \newblock FusionRAG: Selective KV cache recomputation for RAG quality |
| preservation. |
| \newblock \emph{arXiv preprint arXiv:2601.12904}, 2026. |
|
|
| \bibitem[{Sun et~al.}(2025)]{shadowkv} |
| Sun, H., et~al. |
| \newblock ShadowKV: KV cache in shadows at the speed of light. |
| \newblock In \emph{ICML}, 2025. Spotlight. |
|
|
| \bibitem[{Zhang et~al.}(2025)]{xkv} |
| Zhang, Y., et~al. |
| \newblock xKV: Cross-layer SVD for KV cache compression. |
| \newblock \emph{arXiv preprint arXiv:2503.18893}, 2025. |
|
|
| \bibitem[{Liu et~al.}(2024)]{kivi} |
| Liu, Z., et~al. |
| \newblock KIVI: A tuning-free asymmetric 2bit quantization for KV cache. |
| \newblock In \emph{ICML}, 2024. |
|
|
| \bibitem[{Wang et~al.}(2026)]{memart} |
| Wang, X., et~al. |
| \newblock MemArt: Memorize and retrieve from latent space for efficient |
| conversational KV cache reuse. |
| \newblock In \emph{ICLR}, 2026. Submission. |
|
|
| \bibitem[{Harrison}(2026)]{agentmemory} |
| Harrison, C. |
| \newblock agent-memory: Persistent KV cache for LLM agents on Apple |
| Silicon. |
| \newblock \emph{arXiv preprint arXiv:2603.04428}, 2026. |
|
|
| \bibitem[{Kornblith et~al.}(2019)]{kornblith2019} |
| Kornblith, S., Norouzi, M., Lee, H., and Hinton, G. |
| \newblock Similarity of neural network representations revisited. |
| \newblock In \emph{ICML}, 2019. |
|
|
| \bibitem[{Moschella et~al.}(2023)]{moschella2023} |
| Moschella, L., et~al. |
| \newblock Relative representations enable zero-shot latent space |
| communication. |
| \newblock In \emph{ICLR}, 2023. |
|
|
| \bibitem[{TurboQuant Team}(2026)]{turboquant} |
| Behrouz, A., et~al. |
| \newblock TurboQuant: Online vector quantization for KV cache. |
| \newblock In \emph{ICLR}, 2026. |
|
|
| \bibitem[{RAGCache Team}(2025)]{ragcache} |
| Jin, C., et~al. |
| \newblock RAGCache: Efficient knowledge caching for retrieval-augmented |
| generation. |
| \newblock \emph{ACM TOCS}, 2025. |
|
|
| \end{thebibliography} |
|
|
| |
| |
| |
| \appendix |
|
|
| \section{Geodesic Retrieval Pseudocode} |
| \label{app:pseudocode} |
|
|
| \begin{algorithm}[H] |
| \caption{Geodesic Retrieval (4 stages)} |
| \label{alg:geodesic} |
| \begin{algorithmic}[1] |
| \Require Query fingerprint $\mathbf{q}$, HNSW index $\mathcal{I}$, IndexC $\mathcal{C}$ |
| \Ensure Retrieved document ID, confidence level |
|
|
| \State \textbf{Stage 0: Prior Preemption} |
| \If{$\mathcal{C}.\text{is\_chronic\_failure}(\mathbf{q})$} |
| \State \Return $\bot$, LOW |
| \EndIf |
|
|
| \State \textbf{Stage 1: HNSW Search} |
| \State $\{(d_1, s_1), \ldots, (d_k, s_k)\} \gets \mathcal{I}.\text{search}(\mathbf{q}, k)$ |
| \State $\text{margin} \gets s_1 - s_2$ |
| \If{$\text{margin} > \tau_\text{high}$} |
| \State \Return $d_1$, HIGH |
| \ElsIf{$\text{margin} > \tau_\text{med}$} |
| \State \Return $d_1$, MEDIUM |
| \EndIf |
|
|
| \State \textbf{Stage 2: Trajectory Correction} |
| \State $\mathbf{q}' \gets (1-w)\mathbf{q} + w\,\mathbf{fp}_{d_1}$ |
| \State Re-search with $\mathbf{q}'$ |
|
|
| \State \textbf{Stage 3: Negative Constraints} |
| \State Exclude known-incorrect candidates from $\mathcal{C}$ |
|
|
| \State \textbf{Stage 4: Metadata Disambiguation} |
| \State Score by domain overlap, keyword match, norm similarity |
| \State \Return best candidate, LOW |
| \end{algorithmic} |
| \end{algorithm} |
|
|
| \section{EIGENGRAM Format Specification} |
| \label{app:eigengram} |
|
|
| \begin{table}[H] |
| \centering |
| \caption{EIGENGRAM v1.2 binary layout.} |
| \small |
| \begin{tabular}{lcl} |
| \toprule |
| Field & Bytes & Description \\ |
| \midrule |
| Magic & 4 & \texttt{0x454E4752} (``ENGR'') \\ |
| Version & 2 & Major.Minor (1.2) \\ |
| Arch ID & 2 & Architecture enum \\ |
| Layers & 2 & Number of layers \\ |
| Head dim & 2 & Per-head dimension \\ |
| FP vector & $2 \times d \times 2$ & $f_0{+}f_1$ (float16) \\ |
| Metadata & variable & JSON (model, timestamp, \ldots) \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \section{Supported Architectures} |
| \label{app:architectures} |
|
|
| \begin{table}[H] |
| \centering |
| \caption{Multi-architecture support in \engram{}.} |
| \small |
| \begin{tabular}{lcccc} |
| \toprule |
| Architecture & Layers & KV Heads & Head Dim & Attention \\ |
| \midrule |
| Llama 3.2 3B & 28 & 8 & 128 & GQA \\ |
| Llama 3.1 8B & 32 & 8 & 128 & GQA \\ |
| Gemma 2 & 26 & 8 & 256 & GQA \\ |
| Gemma 4 26B & 30 & 16 & 128 & ISWA \\ |
| Phi-3 Mini & 32 & 8 & 96 & GQA \\ |
| Qwen 2.5 7B & 28 & 4 & 128 & GQA \\ |
| Mistral 7B & 32 & 8 & 128 & GQA \\ |
| \bottomrule |
| \end{tabular} |
| \end{table} |
|
|
| \section{Compass Artifact: Genesis of ENGRAM} |
| \label{app:genesis} |
|
|
| This work originated from a systematic deep-research analysis of the KV |
| cache management landscape, conducted via Perplexity Pro deploying 7 |
| sub-agents across 686 sources in 14 minutes. The analysis assessed seven |
| critical research targets: |
|
|
| \begin{enumerate}[leftmargin=*,itemsep=1pt] |
| \item[\textbf{T1.}] \textbf{KV tensor extraction:} No public API |
| exposes structured KV tensors from llama.cpp or Ollama. \engram{} |
| built a blob parser and multi-architecture registry. |
|
|
| \item[\textbf{T2.}] \textbf{FAISS retrieval:} Works for K$\to$K |
| similarity, fails catastrophically for Q$\to$K. \engram{} uses |
| K$\to$K cosine similarity via Fourier fingerprints. |
|
|
| \item[\textbf{T3.}] \textbf{Pre-RoPE keys:} ShadowKV (ICML\,2025) |
| validates that pre-RoPE keys have the sharpest SVD decay. \engram{} |
| extracts pre-RoPE keys in the 8--24 layer band. |
|
|
| \item[\textbf{T4.}] \textbf{Quantization:} QJL hurts in practice |
| (6+ independent confirmations). \engram{} uses INT8 per-row symmetric |
| quantization. |
|
|
| \item[\textbf{T5.}] \textbf{Competitive landscape:} No existing system |
| combines persistent storage, semantic retrieval, cross-model transfer, |
| and agent-native APIs. \emph{This is the gap \engram{} fills.} |
|
|
| \item[\textbf{T6.}] \textbf{TTFT benchmarks:} Target was $>$10$\times$ |
| at 16K context. \engram{} achieved 30--67$\times$ across configurations. |
|
|
| \item[\textbf{T7.}] \textbf{Serialization:} Safetensors is converging |
| as the ecosystem standard. \engram{} designed a custom format |
| (\texttt{.eng}\,v1.2) optimized for $<$800\,byte document certificates. |
| \end{enumerate} |
|
|
| The compass artifact (ID: \texttt{wf-790728d4}) was produced after |
| reading the TurboQuant paper from Google Research (ICLR\,2026). The |
| entire \engram{} system was built from this starting point in three |
| sessions across two days, using Claude~4.6 Sonnet (Thinking) and |
| Claude Code Opus~4.6 at maximum effort. |
|
|
| \vspace{1em} |
| \noindent\rule{\columnwidth}{0.4pt} |
| \begin{center} |
| \small\textit{220 tests passing. 6,181 knowledge vectors indexed.\\ |
| The protocol proves its own paper existed.\\ |
| --- Enigma, April 2026} |
| \end{center} |
|
|
| \end{document} |
|
|