Delete paper.tex
Browse files
paper.tex
DELETED
|
@@ -1,486 +0,0 @@
|
|
| 1 |
-
\documentclass[11pt,a4paper]{article}
|
| 2 |
-
\usepackage[utf8]{inputenc}
|
| 3 |
-
\usepackage[T1]{fontenc}
|
| 4 |
-
\usepackage{amsmath,amssymb}
|
| 5 |
-
\usepackage{booktabs}
|
| 6 |
-
\usepackage{graphicx}
|
| 7 |
-
\usepackage{hyperref}
|
| 8 |
-
\usepackage{listings}
|
| 9 |
-
\usepackage{xcolor}
|
| 10 |
-
\usepackage[margin=1in]{geometry}
|
| 11 |
-
\usepackage{caption}
|
| 12 |
-
\usepackage{subcaption}
|
| 13 |
-
\usepackage{enumitem}
|
| 14 |
-
\usepackage{authblk}
|
| 15 |
-
\usepackage{multicol}
|
| 16 |
-
\usepackage{float}
|
| 17 |
-
|
| 18 |
-
\definecolor{codegreen}{rgb}{0,0.6,0}
|
| 19 |
-
\definecolor{codegray}{rgb}{0.5,0.5,0.5}
|
| 20 |
-
\definecolor{codepurple}{rgb}{0.58,0,0.82}
|
| 21 |
-
\definecolor{backcolour}{rgb}{0.95,0.95,0.95}
|
| 22 |
-
|
| 23 |
-
\lstdefinestyle{mystyle}{
|
| 24 |
-
backgroundcolor=\color{backcolour},
|
| 25 |
-
commentstyle=\color{codegreen},
|
| 26 |
-
keywordstyle=\color{codepurple},
|
| 27 |
-
numberstyle=\tiny\color{codegray},
|
| 28 |
-
stringstyle=\color{codegreen},
|
| 29 |
-
basicstyle=\ttfamily\small,
|
| 30 |
-
breakatwhitespace=false,
|
| 31 |
-
breaklines=true,
|
| 32 |
-
captionpos=b,
|
| 33 |
-
keepspaces=true,
|
| 34 |
-
numbers=left,
|
| 35 |
-
numbersep=5pt,
|
| 36 |
-
showspaces=false,
|
| 37 |
-
showstringspaces=false,
|
| 38 |
-
showtabs=false,
|
| 39 |
-
tabsize=2,
|
| 40 |
-
frame=single
|
| 41 |
-
}
|
| 42 |
-
\lstset{style=mystyle}
|
| 43 |
-
|
| 44 |
-
\title{JIT LoRA: Real-Time Conversational Knowledge Injection\\on Apple Silicon via MLX}
|
| 45 |
-
|
| 46 |
-
\author[1]{E. Elbaz}
|
| 47 |
-
\affil[1]{Independent Research}
|
| 48 |
-
|
| 49 |
-
\date{March 2026}
|
| 50 |
-
|
| 51 |
-
\begin{document}
|
| 52 |
-
|
| 53 |
-
\maketitle
|
| 54 |
-
|
| 55 |
-
\begin{abstract}
|
| 56 |
-
We present a system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd~\cite{mlx2023} for gradient-based LoRA~\cite{hu2021lora} adaptation, the system---J.A.R.V.I.S., a voice-enabled AI assistant---updates its own weights after every response via background backpropagation. We validate on three evaluation tracks: (1)~a controlled fictional-fact experiment achieving 4/4 recall ($n=4$), (2)~a cross-domain scaling test with 41 interlocked facts achieving 69\% direct recall with 50\% multi-hop reasoning, and (3)~a statistically rigorous evaluation against \textbf{35 real-world facts} the model verifiably did not know, yielding \textbf{58.1\% recall} (95\% Wilson CI: [48.5\%, 67.1\%], $n=105$ pooled across 3 independent trials) with \textbf{100\% general knowledge preservation} (CI: [94.0\%, 100.0\%], $n=60$). Training completes in 70 seconds for 35 facts on a 2B-parameter model. Per-category analysis reveals strong performance on structurally distinctive facts (Sports 88.9\%, Awards 85.7\%, Weather 80.0\%) with systematic failure on structurally homogeneous facts (Deaths 18.2\%), establishing both the capabilities and limits of JIT LoRA on small models.
|
| 57 |
-
\end{abstract}
|
| 58 |
-
|
| 59 |
-
\section{Introduction}
|
| 60 |
-
|
| 61 |
-
Can a language model update its own weights \emph{while you're still reading its reply}? We investigate whether real-time LoRA weight updates during conversation can achieve reliable fact recall on consumer Apple Silicon hardware, without catastrophic forgetting~\cite{mccloskey1989catastrophic} of existing knowledge.
|
| 62 |
-
|
| 63 |
-
The initial approach used Apple's Neural Engine (ANE) directly---reverse-engineering the private \texttt{AppleNeuralEngine.framework} via the open-source ANE bridge~\cite{ane_bridge}. The idea: compile LoRA forward and backward kernels into MIL (Machine Learning Intermediate Language) programs, execute them on the ANE via IOSurface-backed tensors, and run adapter training on dedicated hardware while the GPU handles base model inference.
|
| 64 |
-
|
| 65 |
-
The ANE path produced working forward kernels (\texttt{ane\_mil\_lora.py} compiles 4 kernels per adapter: \texttt{lora\_down}, \texttt{lora\_up}, \texttt{grad\_b}, \texttt{grad\_a}), but hit a fundamental wall: ANE kernels produce numpy arrays via IOSurface---opaque to any autograd system. For real gradient-based training, the entire computation graph must be differentiable.
|
| 66 |
-
|
| 67 |
-
The solution: MLX~\cite{mlx2023}. Apple's array framework provides native autograd (\texttt{nn.value\_and\_grad}) that runs on Apple Silicon's unified memory. The base model runs on GPU, LoRA~\cite{hu2021lora} adapters inject differentiable rank-decomposition layers, and \texttt{optim.Adam} updates weights through real backpropagation. The ANE kernels remain in the codebase for a future hybrid inference path (Section~\ref{sec:future}), but the training loop is pure MLX.
|
| 68 |
-
|
| 69 |
-
\section{Related Work}
|
| 70 |
-
|
| 71 |
-
\paragraph{LoRA and parameter-efficient fine-tuning.} LoRA~\cite{hu2021lora} injects trainable low-rank matrices into frozen pretrained weights, reducing trainable parameters by orders of magnitude. QLoRA~\cite{dettmers2023qlora} extends this to quantized models. Both target offline fine-tuning on large datasets over thousands of steps; our work applies LoRA in a real-time, few-shot regime (48--220 steps) during live conversation.
|
| 72 |
-
|
| 73 |
-
\paragraph{Catastrophic forgetting and continual learning.} Neural networks famously overwrite prior knowledge when trained on new data~\cite{mccloskey1989catastrophic}. Elastic Weight Consolidation~\cite{kirkpatrick2017overcoming} penalizes changes to important weights; experience replay~\cite{rolnick2019experience} interleaves old data during training. We adopt experience replay: $\geq$33\% of each training batch consists of general-knowledge Q\&A pairs, which we find sufficient to eliminate catastrophic forgetting entirely (Section~\ref{sec:ablation-reg}).
|
| 74 |
-
|
| 75 |
-
\paragraph{On-device and edge training.} MLX~\cite{mlx2023} provides a NumPy-like API with automatic differentiation on Apple Silicon's unified memory architecture. While most on-device ML work focuses on inference (quantization, pruning), we use MLX for full gradient-based training at interactive speeds.
|
| 76 |
-
|
| 77 |
-
\paragraph{Retrieval-augmented generation.} RAG systems inject knowledge at inference time by prepending retrieved documents to the prompt. JIT LoRA offers a complementary approach: modifying weights directly, which avoids context window limitations but requires a training step. The two approaches are not mutually exclusive.
|
| 78 |
-
|
| 79 |
-
\paragraph{Hybrid architectures.} Qwen3.5 models use Gated Delta Networks (GDN)~\cite{yang2024gated}, which evolved from Mamba's~\cite{gu2023mamba} selective state space design. These layers use Metal-accelerated kernels for inference that lack autograd support, requiring careful mode switching during training (Section~\ref{sec:hybrid}).
|
| 80 |
-
|
| 81 |
-
\section{The System}
|
| 82 |
-
|
| 83 |
-
J.A.R.V.I.S. is a full-stack AI assistant: React frontend with a sci-fi voice interface, Express backend for API routing, and a Python FastAPI daemon for MLX inference and training (Figure~\ref{fig:interface}).
|
| 84 |
-
|
| 85 |
-
\paragraph{Hardware.} All experiments run on a MacBook Pro with Apple M4 Max (128GB unified memory). The 2B model (Qwen3.5-2B-Base) occupies approximately 4GB in bfloat16.
|
| 86 |
-
|
| 87 |
-
\begin{figure}[H]
|
| 88 |
-
\centering
|
| 89 |
-
\includegraphics[width=0.85\textwidth]{figures/jarvis-interface.png}
|
| 90 |
-
\caption{J.A.R.V.I.S. main interface. The orb visualizer responds to audio; the System Logs panel (bottom-right) shows the conversation flow routed through the MLX backend.}
|
| 91 |
-
\label{fig:interface}
|
| 92 |
-
\end{figure}
|
| 93 |
-
|
| 94 |
-
The training loop fires after each conversation turn:
|
| 95 |
-
|
| 96 |
-
\begin{verbatim}
|
| 97 |
-
User speaks/types -> Frontend (React) -> Express Proxy (:3001)
|
| 98 |
-
-> Neural Daemon (:8766) -> MLX Inference with LoRA adapter
|
| 99 |
-
-> SSE token stream -> Frontend display + TTS
|
| 100 |
-
|
| 101 |
-
[After response completes] Response text -> Training Data Manager
|
| 102 |
-
-> LoRA backprop (Adam + cosine LR) -> Adapter weights updated
|
| 103 |
-
-> Next inference uses updated knowledge
|
| 104 |
-
\end{verbatim}
|
| 105 |
-
|
| 106 |
-
The daemon alternates inference and training through a single GPU lock (\texttt{threading.Lock}). After each response, the \texttt{auto\_train} system queues a background training cycle. The next query uses the updated adapter---no model reload, no restart. Training and inference do not run simultaneously; the GPU lock serializes access.
|
| 107 |
-
|
| 108 |
-
\subsection{LoRA Architecture}
|
| 109 |
-
|
| 110 |
-
Rank-32 LoRA~\cite{hu2021lora} adapters inject into four projection matrices per layer:
|
| 111 |
-
\begin{equation}
|
| 112 |
-
y = W_{\text{base}} x + (x A B) \cdot \frac{\alpha}{r}, \quad A \in \mathbb{R}^{d \times 32}, \; B \in \mathbb{R}^{32 \times d}
|
| 113 |
-
\end{equation}
|
| 114 |
-
with $B$ initialized to zeros (model behavior unchanged until training begins). Targets: $W_q, W_v, W_{\text{out}}, W_{\text{down}}$ across all 24 layers, yielding 10.3M trainable parameters (0.54\% of 1.9B total).
|
| 115 |
-
|
| 116 |
-
\subsection{Hybrid Architecture: Gated Delta Network Layers}
|
| 117 |
-
\label{sec:hybrid}
|
| 118 |
-
|
| 119 |
-
Qwen3.5 models use Gated Delta Networks (GDN)~\cite{yang2024gated} for linear attention layers, with Metal-accelerated kernels that lack VJP (vector-Jacobian product) support. The key insight from the \texttt{mlx-lm} source:
|
| 120 |
-
|
| 121 |
-
\begin{lstlisting}[language=Python, numbers=none]
|
| 122 |
-
# qwen3_5.py line 181: use_kernel = not self.training
|
| 123 |
-
# model.train() -> pure MLX ops (differentiable, for backprop)
|
| 124 |
-
# model.eval() -> Metal kernels (fast, for inference)
|
| 125 |
-
\end{lstlisting}
|
| 126 |
-
|
| 127 |
-
We hoist mode switching to cycle boundaries---\texttt{model.train()} once before the training loop, \texttt{model.eval()} once after---rather than per-step.
|
| 128 |
-
|
| 129 |
-
\section{Experiment 1: Controlled Validation (Fictional Facts)}
|
| 130 |
-
|
| 131 |
-
We first validate the system on 4 completely fictional facts with zero overlap to any pretraining data:
|
| 132 |
-
|
| 133 |
-
\begin{itemize}[noitemsep]
|
| 134 |
-
\item ``My neighbor's cat is named Thunderbiscuit''
|
| 135 |
-
\item ``The Pemberton Scale measures dream intensity (0--17)''
|
| 136 |
-
\item ``Chef Aldric Fenwick created starfire risotto in 2197''
|
| 137 |
-
\item ``Zelnorite is found exclusively in Mount Pyrrhex caves''
|
| 138 |
-
\end{itemize}
|
| 139 |
-
|
| 140 |
-
Each fact is represented by 2--3 phrasing variants in the training set, plus 3 general-knowledge regularization pairs, for 12 training pairs total.
|
| 141 |
-
|
| 142 |
-
\begin{table}[H]
|
| 143 |
-
\centering
|
| 144 |
-
\caption{Experiment 1: 4 novel fictional facts, 12 training pairs (9 novel phrasings + 3 regularization). Single run, $n=4$.}
|
| 145 |
-
\label{tab:exp1}
|
| 146 |
-
\begin{tabular}{lcc}
|
| 147 |
-
\toprule
|
| 148 |
-
\textbf{Metric} & \textbf{Baseline} & \textbf{Post-Training} \\
|
| 149 |
-
\midrule
|
| 150 |
-
Direct Recall (4 questions) & 0/4 (0\%) & 4/4 (100\%) \\
|
| 151 |
-
Generalization (4 rephrased) & 0/4 (0\%) & 4/4 (100\%) \\
|
| 152 |
-
General Knowledge (3 real facts) & 3/3 (100\%) & 3/3 (100\%) \\
|
| 153 |
-
\midrule
|
| 154 |
-
Training steps & --- & 48 (4 epochs $\times$ 12 examples) \\
|
| 155 |
-
Training time & --- & 20.2 seconds \\
|
| 156 |
-
Loss & --- & 2.83 $\rightarrow$ 0.14 \\
|
| 157 |
-
\bottomrule
|
| 158 |
-
\end{tabular}
|
| 159 |
-
\end{table}
|
| 160 |
-
|
| 161 |
-
\textbf{Caveat:} With $n=4$, this experiment establishes feasibility but is not statistically meaningful. The Wilson 95\% CI for 4/4 recall is [47.3\%, 100\%]. Experiment~3 (Section~\ref{sec:stat}) addresses this limitation with larger $n$.
|
| 162 |
-
|
| 163 |
-
\begin{figure}[H]
|
| 164 |
-
\centering
|
| 165 |
-
\includegraphics[width=0.85\textwidth]{figures/jarvis-post-training.png}
|
| 166 |
-
\caption{J.A.R.V.I.S. recalling a novel fact after JIT LoRA training. After 28 training steps (loss: 0.08), the model correctly answers ``What is my neighbor's cat named?'' with ``Thunderbiscuit''---a fact it hallucinated (``Whiskers'') before training.}
|
| 167 |
-
\label{fig:recall}
|
| 168 |
-
\end{figure}
|
| 169 |
-
|
| 170 |
-
\section{Experiment 2: Cross-Domain Scaling (41 Fictional Facts)}
|
| 171 |
-
|
| 172 |
-
We scale to 41 facts across 10 interlocked fictional domains with deliberate cross-references (e.g., a mineral used to power engines, refined from another mineral, mined on a specific mountain, on an island governed by a fictional sovereignty).
|
| 173 |
-
|
| 174 |
-
\begin{table}[H]
|
| 175 |
-
\centering
|
| 176 |
-
\caption{Experiment 2: 41 novel facts, 10 domains, 62 training pairs (41 novel + 21 regularization). Single run.}
|
| 177 |
-
\label{tab:exp2}
|
| 178 |
-
\begin{tabular}{lcc}
|
| 179 |
-
\toprule
|
| 180 |
-
\textbf{Category} & \textbf{Score} & \textbf{Notes} \\
|
| 181 |
-
\midrule
|
| 182 |
-
Direct Recall (16) & 11/16 (69\%) & Core facts reliably absorbed \\
|
| 183 |
-
Generalization (16) & 9/16 (56\%) & Rephrased questions work \\
|
| 184 |
-
Cross-Domain Reasoning (8) & 4/8 (50\%) & Multi-hop chains on a 2B model \\
|
| 185 |
-
Negation/Boundary (5) & 5/5 (100\%) & Correctly denies false premises \\
|
| 186 |
-
General Knowledge (10) & 10/10 (100\%) & Knowledge preserved \\
|
| 187 |
-
\midrule
|
| 188 |
-
Training steps & \multicolumn{2}{c}{220 (early stopping at $\sim$3.5 epochs)} \\
|
| 189 |
-
Training time & \multicolumn{2}{c}{121 seconds} \\
|
| 190 |
-
Loss & \multicolumn{2}{c}{2.97 $\rightarrow$ 0.69} \\
|
| 191 |
-
\bottomrule
|
| 192 |
-
\end{tabular}
|
| 193 |
-
\end{table}
|
| 194 |
-
|
| 195 |
-
The 62 training pairs yield 62 steps per epoch; early stopping triggered at approximately 3.5 effective epochs (220 total steps). Each training step takes $\sim$390ms on the M4 Max with the 2B model, which is memory-bandwidth-limited: the entire model ($\sim$4GB) must be read for each forward and backward pass.
|
| 196 |
-
|
| 197 |
-
\section{Experiment 3: Statistical Validation (Real-World Facts)}
|
| 198 |
-
\label{sec:stat}
|
| 199 |
-
|
| 200 |
-
Experiments 1--2 use fictional facts, which guarantees the model has no prior knowledge but limits sample size. To produce statistically meaningful results, we evaluate against \textbf{real-world events from 2025--2026}---facts that post-date the model's training cutoff (verified per-fact against the base model before training).
|
| 201 |
-
|
| 202 |
-
\subsection{Methodology}
|
| 203 |
-
|
| 204 |
-
\begin{enumerate}[noitemsep]
|
| 205 |
-
\item \textbf{Fact sourcing:} 122 facts collected from web search across 8 categories (Sports, Deaths/Obituaries, Awards, Entertainment, Science, Technology/Business, Political Events, Weather/Natural Events). Each fact has a question, canonical answer, and 2--3 verification keywords.
|
| 206 |
-
\item \textbf{Sampling:} 50 facts are sampled proportionally across categories (to keep training time under 2 minutes). Political Events facts were excluded from the final evaluation because all sampled instances were already known to the base model.
|
| 207 |
-
\item \textbf{Baseline pre-test:} Each fact is queried against the unmodified base model. A fact is ``confirmed unknown'' if the model's response matches $<$2 of its verification keywords. Facts the model already knows are excluded from training and evaluation.
|
| 208 |
-
\item \textbf{Training:} Confirmed-unknown facts are converted to training pairs. $\geq$33\% regularization pairs (general-knowledge Q\&A) are added. Training runs for 15 epochs max with early stopping (loss $<$ 0.8 for 2 consecutive epochs).
|
| 209 |
-
\item \textbf{Post-test:} Each trained fact is queried again. General knowledge questions (20 standard questions, e.g., ``What is the capital of France?'') are tested for preservation.
|
| 210 |
-
\item \textbf{Trials:} The full pipeline (reset $\rightarrow$ train $\rightarrow$ evaluate) runs 3 independent times with shuffled fact ordering. Results are pooled for confidence interval computation.
|
| 211 |
-
\item \textbf{Auto-train disabled during evaluation:} The daemon's auto-train feature (which normally fires after each response) is disabled during pre-testing and post-testing to prevent evaluation contamination.
|
| 212 |
-
\end{enumerate}
|
| 213 |
-
|
| 214 |
-
\subsection{Results}
|
| 215 |
-
|
| 216 |
-
From 50 candidate facts, 35 were confirmed unknown (15 already in the model's knowledge). Three independent trials with shuffled ordering produced the results in Table~\ref{tab:exp3}.
|
| 217 |
-
|
| 218 |
-
\begin{table}[H]
|
| 219 |
-
\centering
|
| 220 |
-
\caption{Experiment 3: 35 real-world facts, 52 training pairs (35 novel + 17 regularization), 3 trials. Qwen3.5-2B-Base on M4 Max.}
|
| 221 |
-
\label{tab:exp3}
|
| 222 |
-
\begin{tabular}{lccc}
|
| 223 |
-
\toprule
|
| 224 |
-
\textbf{Metric} & \textbf{Pooled} & \textbf{Per-Trial} & \textbf{95\% Wilson CI} \\
|
| 225 |
-
\midrule
|
| 226 |
-
\textbf{Recall} & 61/105 (58.1\%) & 65.7\%, 54.3\%, 54.3\% & [48.5\%, 67.1\%] \\
|
| 227 |
-
\textbf{General Knowledge} & 60/60 (100.0\%) & 100\%, 100\%, 100\% & [94.0\%, 100.0\%] \\
|
| 228 |
-
\midrule
|
| 229 |
-
Training time & \multicolumn{3}{c}{69.6s $\pm$ 1.2s (180 steps)} \\
|
| 230 |
-
Loss (mean $\pm$ sd) & \multicolumn{3}{c}{1.78 $\pm$ 0.43 $\rightarrow$ 0.36 $\pm$ 0.10} \\
|
| 231 |
-
Per-step time & \multicolumn{3}{c}{$\sim$390ms} \\
|
| 232 |
-
\bottomrule
|
| 233 |
-
\end{tabular}
|
| 234 |
-
\end{table}
|
| 235 |
-
|
| 236 |
-
\subsection{Per-Category Analysis}
|
| 237 |
-
|
| 238 |
-
Recall varies dramatically by fact category (Table~\ref{tab:categories}), revealing a systematic pattern in what small models learn well vs.\ poorly via JIT LoRA:
|
| 239 |
-
|
| 240 |
-
\begin{table}[H]
|
| 241 |
-
\centering
|
| 242 |
-
\caption{Per-category recall pooled across 3 trials. Seven categories had confirmed-unknown facts; Political Events was excluded (all sampled facts were already known to the model).}
|
| 243 |
-
\label{tab:categories}
|
| 244 |
-
\begin{tabular}{lcccl}
|
| 245 |
-
\toprule
|
| 246 |
-
\textbf{Category} & \textbf{Correct} & \textbf{Total} & \textbf{Rate} & \textbf{95\% CI} \\
|
| 247 |
-
\midrule
|
| 248 |
-
Science & 3 & 3 & 100.0\% & [43.8\%, 100.0\%] \\
|
| 249 |
-
Sports & 16 & 18 & 88.9\% & [67.2\%, 96.9\%] \\
|
| 250 |
-
Awards & 18 & 21 & 85.7\% & [65.4\%, 95.0\%] \\
|
| 251 |
-
Weather/Natural Events & 12 & 15 & 80.0\% & [54.8\%, 93.0\%] \\
|
| 252 |
-
Technology/Business & 2 & 3 & 66.7\% & [20.8\%, 93.9\%] \\
|
| 253 |
-
Entertainment & 4 & 12 & 33.3\% & [13.8\%, 60.9\%] \\
|
| 254 |
-
Deaths/Obituaries & 6 & 33 & 18.2\% & [8.6\%, 34.4\%] \\
|
| 255 |
-
\midrule
|
| 256 |
-
\textbf{Excl.\ Deaths} & \textbf{55} & \textbf{72} & \textbf{76.4\%} & [65.4\%, 84.8\%] \\
|
| 257 |
-
\bottomrule
|
| 258 |
-
\end{tabular}
|
| 259 |
-
\end{table}
|
| 260 |
-
|
| 261 |
-
\subsection{Failure Analysis: Why Deaths Fail}
|
| 262 |
-
|
| 263 |
-
The Deaths/Obituaries category (18.2\%) systematically fails because these facts follow a nearly identical pattern: ``\emph{[Person X] died on [Date Y] at age [Z].}'' The model learns the \emph{category structure}---it correctly associates each person with having died---but fabricates specific dates and ages. Example:
|
| 264 |
-
|
| 265 |
-
\begin{quote}
|
| 266 |
-
\textbf{Training:} ``Frank Gehry died on December 5, 2025'' \\
|
| 267 |
-
\textbf{Model output:} ``Frank Gehry\ldots died on February 5, 2025, at the age of 95'' \\
|
| 268 |
-
\textbf{Result:} Knows Gehry died, wrong date. Fails keyword check on ``december 5 2025''.
|
| 269 |
-
\end{quote}
|
| 270 |
-
|
| 271 |
-
This is a known limitation of LoRA on small models~\cite{hu2021lora}: with many facts sharing the same structural pattern, the model's limited adapter capacity ($\sim$10M params) blends specific details across similar training examples. Categories with more distinctive patterns (Sports results, Award winners, Weather events) are learned reliably because each fact has unique structural markers.
|
| 272 |
-
|
| 273 |
-
\section{Ablation Studies}
|
| 274 |
-
|
| 275 |
-
Every parameter was tested empirically. Two parameters dominate; the rest have minimal effect.
|
| 276 |
-
|
| 277 |
-
\subsection{Learning Rate: The Decisive Factor}
|
| 278 |
-
|
| 279 |
-
\begin{table}[H]
|
| 280 |
-
\centering
|
| 281 |
-
\caption{Learning rate determines training speed. Per-step time is constant ($\sim$390ms) for the 2B model on M4 Max.}
|
| 282 |
-
\label{tab:lr}
|
| 283 |
-
\begin{tabular}{lcccc}
|
| 284 |
-
\toprule
|
| 285 |
-
\textbf{Learning Rate} & \textbf{Epochs to $<$0.5 loss} & \textbf{Steps} & \textbf{Time} & \textbf{Recall} \\
|
| 286 |
-
\midrule
|
| 287 |
-
$5 \times 10^{-5}$ (standard LoRA) & 25+ & 400 & 168s & 4/4$^*$ \\
|
| 288 |
-
$1 \times 10^{-4}$ & 10 & 80 & 35s & 4/4$^*$ \\
|
| 289 |
-
$5 \times 10^{-4}$ (\textbf{ours}) & 4 & 48 & \textbf{20s} & \textbf{4/4}$^*$ \\
|
| 290 |
-
\bottomrule
|
| 291 |
-
\end{tabular}
|
| 292 |
-
\end{table}
|
| 293 |
-
{\small $^*$Measured on the 4-fact fictional experiment (Experiment 1; Table~\ref{tab:exp1}). Statistical validation (Table~\ref{tab:exp3}) uses the 5e-4 rate.}
|
| 294 |
-
|
| 295 |
-
The speedup comes entirely from faster convergence, not faster steps. Standard LoRA uses $10^{-4}$ to $5 \times 10^{-5}$ because it trains for thousands of steps on large datasets~\cite{hu2021lora}. JIT learning needs convergence in single-digit epochs. Gradient clipping (norm 1.0) prevents instability at this aggressive rate.
|
| 296 |
-
|
| 297 |
-
\subsection{Regularization Ratio: The Catastrophic Forgetting Threshold}
|
| 298 |
-
\label{sec:ablation-reg}
|
| 299 |
-
|
| 300 |
-
\begin{table}[H]
|
| 301 |
-
\centering
|
| 302 |
-
\caption{Regularization ratio vs.\ knowledge preservation (measured on Experiment 2). A threshold exists at $\sim$33\%.}
|
| 303 |
-
\label{tab:reg}
|
| 304 |
-
\begin{tabular}{cccc}
|
| 305 |
-
\toprule
|
| 306 |
-
\textbf{Reg.\ Ratio} & \textbf{Novel : Real-World} & \textbf{General Knowledge} & \textbf{Effect} \\
|
| 307 |
-
\midrule
|
| 308 |
-
$\sim$16\% & 41 : 8 & 3/8 (38\%) & Catastrophic forgetting \\
|
| 309 |
-
$\sim$34\% & 41 : 21 & 10/10 (100\%) & Preserved \\
|
| 310 |
-
$\sim$33\% & 35 : 17 & 20/20 (100\%)$^\dagger$ & Preserved (Experiment 3) \\
|
| 311 |
-
\bottomrule
|
| 312 |
-
\end{tabular}
|
| 313 |
-
\end{table}
|
| 314 |
-
{\small $^\dagger$60/60 across 3 trials (CI: [94.0\%, 100.0\%]).}
|
| 315 |
-
|
| 316 |
-
At $\sim$16\% regularization, the model overwrites core knowledge~\cite{mccloskey1989catastrophic}---``What is the capital of France?'' $\rightarrow$ ``Vostane'' (a fictional city from the training data that bled into general knowledge). At $\geq$33\%, real-world knowledge is preserved. This is a critical finding for production deployment: always include $\geq$33\% real-world Q\&A pairs in every training batch, consistent with experience replay findings in continual learning~\cite{rolnick2019experience}. Experiment~3 independently confirms this threshold.
|
| 317 |
-
|
| 318 |
-
\subsection{What Doesn't Help (and Why)}
|
| 319 |
-
|
| 320 |
-
\begin{table}[H]
|
| 321 |
-
\centering
|
| 322 |
-
\caption{Techniques that do NOT improve JIT training on Apple Silicon.}
|
| 323 |
-
\label{tab:nospeedup}
|
| 324 |
-
\begin{tabular}{lcl}
|
| 325 |
-
\toprule
|
| 326 |
-
\textbf{Technique} & \textbf{Effect} & \textbf{Why} \\
|
| 327 |
-
\midrule
|
| 328 |
-
\texttt{mx.compile()} & +20s overhead, $-$5\%/step & First-trace cost not amortized in $<$200 steps \\
|
| 329 |
-
Batch=8 (padded tensor) & 2.5s/step vs 0.42s & Memory-bandwidth-limited \\
|
| 330 |
-
LoRA rank 8 vs 32 & No speed change & Base model forward/backward dominates \\
|
| 331 |
-
\bottomrule
|
| 332 |
-
\end{tabular}
|
| 333 |
-
\end{table}
|
| 334 |
-
|
| 335 |
-
Apple Silicon's unified memory architecture means forward and backward passes are \textbf{memory-bandwidth-limited}, not compute-limited. Batching 8 examples into a single padded tensor takes 2.5s per step (vs 0.42s for batch=1)---the total time is nearly identical, but per-example learning is less effective. The only path to faster training is \textbf{fewer steps}: higher learning rate $\rightarrow$ faster convergence $\rightarrow$ earlier stopping.
|
| 336 |
-
|
| 337 |
-
\section{Where This Goes: Swarm Agent JIT Learning}
|
| 338 |
-
\label{sec:future}
|
| 339 |
-
|
| 340 |
-
\subsection{The Vision}
|
| 341 |
-
|
| 342 |
-
The system demonstrated here is single-agent: one model, one adapter, one conversation. The longer-term goal is a \textbf{cognitive swarm}---multiple specialized agents that learn different aspects of the same conversation and compose their knowledge at inference time.
|
| 343 |
-
|
| 344 |
-
\begin{verbatim}
|
| 345 |
-
Shared Conversation Context
|
| 346 |
-
|
|
| 347 |
-
+---------------+---------------+
|
| 348 |
-
| | |
|
| 349 |
-
Agent-Facts Agent-Style Agent-Tools
|
| 350 |
-
(LoRA-A) (LoRA-B) (LoRA-C)
|
| 351 |
-
| | |
|
| 352 |
-
+-------+-------+-------+-------+
|
| 353 |
-
| |
|
| 354 |
-
Adapter Merge Knowledge Sync
|
| 355 |
-
|
|
| 356 |
-
Unified Response
|
| 357 |
-
\end{verbatim}
|
| 358 |
-
|
| 359 |
-
At inference, adapters compose via weight addition: $W = W_{\text{base}} + \sum_i \alpha_i (A_i B_i)$, with dynamic scaling factors $\alpha_i$ adjusted per query based on detected intent.
|
| 360 |
-
|
| 361 |
-
\subsection{ANE--GPU Parallelism for Multi-Agent Inference}
|
| 362 |
-
|
| 363 |
-
The ANE kernels compiled in \texttt{ane\_mil\_lora.py} represent an untapped compute path. While ANE cannot support autograd (IOSurface tensors are opaque to differentiation), it can accelerate LoRA forward passes during inference:
|
| 364 |
-
|
| 365 |
-
\begin{itemize}[noitemsep]
|
| 366 |
-
\item GPU runs base model forward pass
|
| 367 |
-
\item ANE simultaneously runs LoRA adapter forward passes (precompiled kernels)
|
| 368 |
-
\item Results merge on unified memory (zero-copy)
|
| 369 |
-
\end{itemize}
|
| 370 |
-
|
| 371 |
-
For multi-agent inference, this means running 3--4 adapter forward passes on ANE while the GPU handles the base model. The training loop remains on GPU (MLX autograd), but inference could benefit from the otherwise-idle Neural Engine. This path is speculative and has not been benchmarked.
|
| 372 |
-
|
| 373 |
-
\section{Reproducing This}
|
| 374 |
-
|
| 375 |
-
\textbf{Hardware:} Apple Silicon Mac (M-series). Tested on M4 Max, 128GB. Models $\leq$2B parameters should work on 16GB machines.
|
| 376 |
-
|
| 377 |
-
\begin{lstlisting}[language=bash, numbers=none]
|
| 378 |
-
pip install mlx mlx-lm fastapi uvicorn requests
|
| 379 |
-
|
| 380 |
-
# Self-test (downloads Qwen2.5-0.5B, trains 5 steps)
|
| 381 |
-
python3 src/mlx_lora_trainer.py
|
| 382 |
-
|
| 383 |
-
# Full E2E through daemon
|
| 384 |
-
python3 src/neural_daemon.py # Terminal 1
|
| 385 |
-
curl -X POST http://localhost:8766/activate \
|
| 386 |
-
-d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}'
|
| 387 |
-
python3 tests/test_daemon_e2e.py # 4 facts, 20s
|
| 388 |
-
python3 tests/test_deep_e2e.py # 41 facts, 121s
|
| 389 |
-
python3 tests/test_statistical_e2e.py # 35+ facts, 3 trials, ~4 min
|
| 390 |
-
\end{lstlisting}
|
| 391 |
-
|
| 392 |
-
Code available at: \url{https://github.com/eelbaz/jit-lora}
|
| 393 |
-
|
| 394 |
-
\section{Complete Configuration}
|
| 395 |
-
|
| 396 |
-
\begin{table}[H]
|
| 397 |
-
\centering
|
| 398 |
-
\caption{Optimized configuration for JIT LoRA training.}
|
| 399 |
-
\begin{tabular}{lrl}
|
| 400 |
-
\toprule
|
| 401 |
-
\textbf{Parameter} & \textbf{Value} & \textbf{Why} \\
|
| 402 |
-
\midrule
|
| 403 |
-
Learning rate & $5 \times 10^{-4}$ & 10$\times$ standard; converges in $\sim$4 epochs \\
|
| 404 |
-
LR schedule & Cosine $\rightarrow 5 \times 10^{-5}$ & Prevents late-epoch overshoot \\
|
| 405 |
-
Gradient clip & 1.0 & Stability at high LR \\
|
| 406 |
-
LoRA rank & 32 & Capacity for $\sim$35 facts per session \\
|
| 407 |
-
LoRA $\alpha$ & 32 & Scale = $\alpha/r$ = 1.0 \\
|
| 408 |
-
LoRA targets & q, v, out, down\_proj & Broad coverage (attention + MLP) \\
|
| 409 |
-
Max epochs & 15 & Upper bound; early stop fires sooner \\
|
| 410 |
-
Early stop threshold & 0.8 & Conservative \\
|
| 411 |
-
Early stop patience & 2 & Consecutive epochs below threshold \\
|
| 412 |
-
Min epochs & 3 & Don't stop before model has seen the data \\
|
| 413 |
-
Regularization ratio & $\geq$33\% & Below this: catastrophic forgetting \\
|
| 414 |
-
Optimizer & Adam & $\beta_1$=0.9, $\beta_2$=0.999 \\
|
| 415 |
-
\texttt{mx.compile()} & Off & 20s overhead not amortized \\
|
| 416 |
-
Batch size & 1 & Per-example steps; batching doesn't help \\
|
| 417 |
-
\bottomrule
|
| 418 |
-
\end{tabular}
|
| 419 |
-
\end{table}
|
| 420 |
-
|
| 421 |
-
\section{Conclusion}
|
| 422 |
-
|
| 423 |
-
A language model that updates its own weights mid-conversation runs on a MacBook in 70 seconds for 35 real-world facts, achieving 58.1\% recall with zero knowledge degradation. The critical insights: use a 10$\times$ higher learning rate than standard LoRA~\cite{hu2021lora} (gradient clipping keeps it stable), include $\geq$33\% real-world data to prevent catastrophic forgetting~\cite{mccloskey1989catastrophic}, and don't bother with compilation or batching for short training runs on Apple Silicon.
|
| 424 |
-
|
| 425 |
-
The per-category analysis reveals that JIT LoRA on small models works well for facts with distinctive structural patterns (Sports, Awards, Science: 76--100\%) but struggles with structurally similar facts (Deaths: 18\%). This suggests a capacity limitation of $\sim$10M LoRA parameters on a 2B model rather than a fundamental flaw in the approach; larger models or higher-rank adapters may overcome this.
|
| 426 |
-
|
| 427 |
-
The system is end-to-end functional---J.A.R.V.I.S. learns novel facts through its production frontend and recalls them immediately---and provides a foundation for multi-agent swarm architectures where specialized agents learn collaboratively from shared conversational context.
|
| 428 |
-
|
| 429 |
-
\begin{figure}[H]
|
| 430 |
-
\centering
|
| 431 |
-
\includegraphics[width=0.85\textwidth]{figures/jarvis-general-knowledge.png}
|
| 432 |
-
\caption{General knowledge preservation after LoRA training. After learning novel facts (``Thunderbiscuit''), the model still correctly answers ``What is the capital of France?'' with ``Paris,'' demonstrating zero catastrophic forgetting.}
|
| 433 |
-
\label{fig:general}
|
| 434 |
-
\end{figure}
|
| 435 |
-
|
| 436 |
-
\bibliographystyle{plain}
|
| 437 |
-
\begin{thebibliography}{10}
|
| 438 |
-
|
| 439 |
-
\bibitem{hu2021lora}
|
| 440 |
-
E.~J. Hu et al.
|
| 441 |
-
\newblock LoRA: Low-rank adaptation of large language models.
|
| 442 |
-
\newblock {\em arXiv:2106.09685}, 2021.
|
| 443 |
-
|
| 444 |
-
\bibitem{mlx2023}
|
| 445 |
-
A.~Hannun et al.
|
| 446 |
-
\newblock MLX: An array framework for Apple Silicon.
|
| 447 |
-
\newblock {\em Apple ML Research}, 2023.
|
| 448 |
-
|
| 449 |
-
\bibitem{dettmers2023qlora}
|
| 450 |
-
T.~Dettmers et al.
|
| 451 |
-
\newblock QLoRA: Efficient finetuning of quantized language models.
|
| 452 |
-
\newblock {\em arXiv:2305.14314}, 2023.
|
| 453 |
-
|
| 454 |
-
\bibitem{mccloskey1989catastrophic}
|
| 455 |
-
M.~McCloskey and N.~J. Cohen.
|
| 456 |
-
\newblock Catastrophic interference in connectionist networks.
|
| 457 |
-
\newblock {\em Psychology of Learning and Motivation}, 24:109--165, 1989.
|
| 458 |
-
|
| 459 |
-
\bibitem{rolnick2019experience}
|
| 460 |
-
D.~Rolnick et al.
|
| 461 |
-
\newblock Experience replay for continual learning.
|
| 462 |
-
\newblock {\em NeurIPS}, 2019.
|
| 463 |
-
|
| 464 |
-
\bibitem{kirkpatrick2017overcoming}
|
| 465 |
-
J.~Kirkpatrick et al.
|
| 466 |
-
\newblock Overcoming catastrophic forgetting in neural networks.
|
| 467 |
-
\newblock {\em PNAS}, 114(13):3521--3526, 2017.
|
| 468 |
-
|
| 469 |
-
\bibitem{gu2023mamba}
|
| 470 |
-
A.~Gu and T.~Dao.
|
| 471 |
-
\newblock Mamba: Linear-time sequence modeling with selective state spaces.
|
| 472 |
-
\newblock {\em arXiv:2312.00752}, 2023.
|
| 473 |
-
|
| 474 |
-
\bibitem{yang2024gated}
|
| 475 |
-
S.~Yang et al.
|
| 476 |
-
\newblock Gated delta networks: Improving mamba2 with delta rule.
|
| 477 |
-
\newblock {\em arXiv:2412.06464}, 2024.
|
| 478 |
-
|
| 479 |
-
\bibitem{ane_bridge}
|
| 480 |
-
Maderix.
|
| 481 |
-
\newblock ANE: Apple Neural Engine reverse-engineering framework.
|
| 482 |
-
\newblock \url{https://github.com/maderix/ANE}, 2023.
|
| 483 |
-
|
| 484 |
-
\end{thebibliography}
|
| 485 |
-
|
| 486 |
-
\end{document}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|