Ex0bit commited on
Commit
d3ae4e4
·
1 Parent(s): 208eb59

Delete paper.tex

Browse files
Files changed (1) hide show
  1. paper.tex +0 -486
paper.tex DELETED
@@ -1,486 +0,0 @@
1
- \documentclass[11pt,a4paper]{article}
2
- \usepackage[utf8]{inputenc}
3
- \usepackage[T1]{fontenc}
4
- \usepackage{amsmath,amssymb}
5
- \usepackage{booktabs}
6
- \usepackage{graphicx}
7
- \usepackage{hyperref}
8
- \usepackage{listings}
9
- \usepackage{xcolor}
10
- \usepackage[margin=1in]{geometry}
11
- \usepackage{caption}
12
- \usepackage{subcaption}
13
- \usepackage{enumitem}
14
- \usepackage{authblk}
15
- \usepackage{multicol}
16
- \usepackage{float}
17
-
18
- \definecolor{codegreen}{rgb}{0,0.6,0}
19
- \definecolor{codegray}{rgb}{0.5,0.5,0.5}
20
- \definecolor{codepurple}{rgb}{0.58,0,0.82}
21
- \definecolor{backcolour}{rgb}{0.95,0.95,0.95}
22
-
23
- \lstdefinestyle{mystyle}{
24
- backgroundcolor=\color{backcolour},
25
- commentstyle=\color{codegreen},
26
- keywordstyle=\color{codepurple},
27
- numberstyle=\tiny\color{codegray},
28
- stringstyle=\color{codegreen},
29
- basicstyle=\ttfamily\small,
30
- breakatwhitespace=false,
31
- breaklines=true,
32
- captionpos=b,
33
- keepspaces=true,
34
- numbers=left,
35
- numbersep=5pt,
36
- showspaces=false,
37
- showstringspaces=false,
38
- showtabs=false,
39
- tabsize=2,
40
- frame=single
41
- }
42
- \lstset{style=mystyle}
43
-
44
- \title{JIT LoRA: Real-Time Conversational Knowledge Injection\\on Apple Silicon via MLX}
45
-
46
- \author[1]{E. Elbaz}
47
- \affil[1]{Independent Research}
48
-
49
- \date{March 2026}
50
-
51
- \begin{document}
52
-
53
- \maketitle
54
-
55
- \begin{abstract}
56
- We present a system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd~\cite{mlx2023} for gradient-based LoRA~\cite{hu2021lora} adaptation, the system---J.A.R.V.I.S., a voice-enabled AI assistant---updates its own weights after every response via background backpropagation. We validate on three evaluation tracks: (1)~a controlled fictional-fact experiment achieving 4/4 recall ($n=4$), (2)~a cross-domain scaling test with 41 interlocked facts achieving 69\% direct recall with 50\% multi-hop reasoning, and (3)~a statistically rigorous evaluation against \textbf{35 real-world facts} the model verifiably did not know, yielding \textbf{58.1\% recall} (95\% Wilson CI: [48.5\%, 67.1\%], $n=105$ pooled across 3 independent trials) with \textbf{100\% general knowledge preservation} (CI: [94.0\%, 100.0\%], $n=60$). Training completes in 70 seconds for 35 facts on a 2B-parameter model. Per-category analysis reveals strong performance on structurally distinctive facts (Sports 88.9\%, Awards 85.7\%, Weather 80.0\%) with systematic failure on structurally homogeneous facts (Deaths 18.2\%), establishing both the capabilities and limits of JIT LoRA on small models.
57
- \end{abstract}
58
-
59
- \section{Introduction}
60
-
61
- Can a language model update its own weights \emph{while you're still reading its reply}? We investigate whether real-time LoRA weight updates during conversation can achieve reliable fact recall on consumer Apple Silicon hardware, without catastrophic forgetting~\cite{mccloskey1989catastrophic} of existing knowledge.
62
-
63
- The initial approach used Apple's Neural Engine (ANE) directly---reverse-engineering the private \texttt{AppleNeuralEngine.framework} via the open-source ANE bridge~\cite{ane_bridge}. The idea: compile LoRA forward and backward kernels into MIL (Machine Learning Intermediate Language) programs, execute them on the ANE via IOSurface-backed tensors, and run adapter training on dedicated hardware while the GPU handles base model inference.
64
-
65
- The ANE path produced working forward kernels (\texttt{ane\_mil\_lora.py} compiles 4 kernels per adapter: \texttt{lora\_down}, \texttt{lora\_up}, \texttt{grad\_b}, \texttt{grad\_a}), but hit a fundamental wall: ANE kernels produce numpy arrays via IOSurface---opaque to any autograd system. For real gradient-based training, the entire computation graph must be differentiable.
66
-
67
- The solution: MLX~\cite{mlx2023}. Apple's array framework provides native autograd (\texttt{nn.value\_and\_grad}) that runs on Apple Silicon's unified memory. The base model runs on GPU, LoRA~\cite{hu2021lora} adapters inject differentiable rank-decomposition layers, and \texttt{optim.Adam} updates weights through real backpropagation. The ANE kernels remain in the codebase for a future hybrid inference path (Section~\ref{sec:future}), but the training loop is pure MLX.
68
-
69
- \section{Related Work}
70
-
71
- \paragraph{LoRA and parameter-efficient fine-tuning.} LoRA~\cite{hu2021lora} injects trainable low-rank matrices into frozen pretrained weights, reducing trainable parameters by orders of magnitude. QLoRA~\cite{dettmers2023qlora} extends this to quantized models. Both target offline fine-tuning on large datasets over thousands of steps; our work applies LoRA in a real-time, few-shot regime (48--220 steps) during live conversation.
72
-
73
- \paragraph{Catastrophic forgetting and continual learning.} Neural networks famously overwrite prior knowledge when trained on new data~\cite{mccloskey1989catastrophic}. Elastic Weight Consolidation~\cite{kirkpatrick2017overcoming} penalizes changes to important weights; experience replay~\cite{rolnick2019experience} interleaves old data during training. We adopt experience replay: $\geq$33\% of each training batch consists of general-knowledge Q\&A pairs, which we find sufficient to eliminate catastrophic forgetting entirely (Section~\ref{sec:ablation-reg}).
74
-
75
- \paragraph{On-device and edge training.} MLX~\cite{mlx2023} provides a NumPy-like API with automatic differentiation on Apple Silicon's unified memory architecture. While most on-device ML work focuses on inference (quantization, pruning), we use MLX for full gradient-based training at interactive speeds.
76
-
77
- \paragraph{Retrieval-augmented generation.} RAG systems inject knowledge at inference time by prepending retrieved documents to the prompt. JIT LoRA offers a complementary approach: modifying weights directly, which avoids context window limitations but requires a training step. The two approaches are not mutually exclusive.
78
-
79
- \paragraph{Hybrid architectures.} Qwen3.5 models use Gated Delta Networks (GDN)~\cite{yang2024gated}, which evolved from Mamba's~\cite{gu2023mamba} selective state space design. These layers use Metal-accelerated kernels for inference that lack autograd support, requiring careful mode switching during training (Section~\ref{sec:hybrid}).
80
-
81
- \section{The System}
82
-
83
- J.A.R.V.I.S. is a full-stack AI assistant: React frontend with a sci-fi voice interface, Express backend for API routing, and a Python FastAPI daemon for MLX inference and training (Figure~\ref{fig:interface}).
84
-
85
- \paragraph{Hardware.} All experiments run on a MacBook Pro with Apple M4 Max (128GB unified memory). The 2B model (Qwen3.5-2B-Base) occupies approximately 4GB in bfloat16.
86
-
87
- \begin{figure}[H]
88
- \centering
89
- \includegraphics[width=0.85\textwidth]{figures/jarvis-interface.png}
90
- \caption{J.A.R.V.I.S. main interface. The orb visualizer responds to audio; the System Logs panel (bottom-right) shows the conversation flow routed through the MLX backend.}
91
- \label{fig:interface}
92
- \end{figure}
93
-
94
- The training loop fires after each conversation turn:
95
-
96
- \begin{verbatim}
97
- User speaks/types -> Frontend (React) -> Express Proxy (:3001)
98
- -> Neural Daemon (:8766) -> MLX Inference with LoRA adapter
99
- -> SSE token stream -> Frontend display + TTS
100
-
101
- [After response completes] Response text -> Training Data Manager
102
- -> LoRA backprop (Adam + cosine LR) -> Adapter weights updated
103
- -> Next inference uses updated knowledge
104
- \end{verbatim}
105
-
106
- The daemon alternates inference and training through a single GPU lock (\texttt{threading.Lock}). After each response, the \texttt{auto\_train} system queues a background training cycle. The next query uses the updated adapter---no model reload, no restart. Training and inference do not run simultaneously; the GPU lock serializes access.
107
-
108
- \subsection{LoRA Architecture}
109
-
110
- Rank-32 LoRA~\cite{hu2021lora} adapters inject into four projection matrices per layer:
111
- \begin{equation}
112
- y = W_{\text{base}} x + (x A B) \cdot \frac{\alpha}{r}, \quad A \in \mathbb{R}^{d \times 32}, \; B \in \mathbb{R}^{32 \times d}
113
- \end{equation}
114
- with $B$ initialized to zeros (model behavior unchanged until training begins). Targets: $W_q, W_v, W_{\text{out}}, W_{\text{down}}$ across all 24 layers, yielding 10.3M trainable parameters (0.54\% of 1.9B total).
115
-
116
- \subsection{Hybrid Architecture: Gated Delta Network Layers}
117
- \label{sec:hybrid}
118
-
119
- Qwen3.5 models use Gated Delta Networks (GDN)~\cite{yang2024gated} for linear attention layers, with Metal-accelerated kernels that lack VJP (vector-Jacobian product) support. The key insight from the \texttt{mlx-lm} source:
120
-
121
- \begin{lstlisting}[language=Python, numbers=none]
122
- # qwen3_5.py line 181: use_kernel = not self.training
123
- # model.train() -> pure MLX ops (differentiable, for backprop)
124
- # model.eval() -> Metal kernels (fast, for inference)
125
- \end{lstlisting}
126
-
127
- We hoist mode switching to cycle boundaries---\texttt{model.train()} once before the training loop, \texttt{model.eval()} once after---rather than per-step.
128
-
129
- \section{Experiment 1: Controlled Validation (Fictional Facts)}
130
-
131
- We first validate the system on 4 completely fictional facts with zero overlap to any pretraining data:
132
-
133
- \begin{itemize}[noitemsep]
134
- \item ``My neighbor's cat is named Thunderbiscuit''
135
- \item ``The Pemberton Scale measures dream intensity (0--17)''
136
- \item ``Chef Aldric Fenwick created starfire risotto in 2197''
137
- \item ``Zelnorite is found exclusively in Mount Pyrrhex caves''
138
- \end{itemize}
139
-
140
- Each fact is represented by 2--3 phrasing variants in the training set, plus 3 general-knowledge regularization pairs, for 12 training pairs total.
141
-
142
- \begin{table}[H]
143
- \centering
144
- \caption{Experiment 1: 4 novel fictional facts, 12 training pairs (9 novel phrasings + 3 regularization). Single run, $n=4$.}
145
- \label{tab:exp1}
146
- \begin{tabular}{lcc}
147
- \toprule
148
- \textbf{Metric} & \textbf{Baseline} & \textbf{Post-Training} \\
149
- \midrule
150
- Direct Recall (4 questions) & 0/4 (0\%) & 4/4 (100\%) \\
151
- Generalization (4 rephrased) & 0/4 (0\%) & 4/4 (100\%) \\
152
- General Knowledge (3 real facts) & 3/3 (100\%) & 3/3 (100\%) \\
153
- \midrule
154
- Training steps & --- & 48 (4 epochs $\times$ 12 examples) \\
155
- Training time & --- & 20.2 seconds \\
156
- Loss & --- & 2.83 $\rightarrow$ 0.14 \\
157
- \bottomrule
158
- \end{tabular}
159
- \end{table}
160
-
161
- \textbf{Caveat:} With $n=4$, this experiment establishes feasibility but is not statistically meaningful. The Wilson 95\% CI for 4/4 recall is [47.3\%, 100\%]. Experiment~3 (Section~\ref{sec:stat}) addresses this limitation with larger $n$.
162
-
163
- \begin{figure}[H]
164
- \centering
165
- \includegraphics[width=0.85\textwidth]{figures/jarvis-post-training.png}
166
- \caption{J.A.R.V.I.S. recalling a novel fact after JIT LoRA training. After 28 training steps (loss: 0.08), the model correctly answers ``What is my neighbor's cat named?'' with ``Thunderbiscuit''---a fact it hallucinated (``Whiskers'') before training.}
167
- \label{fig:recall}
168
- \end{figure}
169
-
170
- \section{Experiment 2: Cross-Domain Scaling (41 Fictional Facts)}
171
-
172
- We scale to 41 facts across 10 interlocked fictional domains with deliberate cross-references (e.g., a mineral used to power engines, refined from another mineral, mined on a specific mountain, on an island governed by a fictional sovereignty).
173
-
174
- \begin{table}[H]
175
- \centering
176
- \caption{Experiment 2: 41 novel facts, 10 domains, 62 training pairs (41 novel + 21 regularization). Single run.}
177
- \label{tab:exp2}
178
- \begin{tabular}{lcc}
179
- \toprule
180
- \textbf{Category} & \textbf{Score} & \textbf{Notes} \\
181
- \midrule
182
- Direct Recall (16) & 11/16 (69\%) & Core facts reliably absorbed \\
183
- Generalization (16) & 9/16 (56\%) & Rephrased questions work \\
184
- Cross-Domain Reasoning (8) & 4/8 (50\%) & Multi-hop chains on a 2B model \\
185
- Negation/Boundary (5) & 5/5 (100\%) & Correctly denies false premises \\
186
- General Knowledge (10) & 10/10 (100\%) & Knowledge preserved \\
187
- \midrule
188
- Training steps & \multicolumn{2}{c}{220 (early stopping at $\sim$3.5 epochs)} \\
189
- Training time & \multicolumn{2}{c}{121 seconds} \\
190
- Loss & \multicolumn{2}{c}{2.97 $\rightarrow$ 0.69} \\
191
- \bottomrule
192
- \end{tabular}
193
- \end{table}
194
-
195
- The 62 training pairs yield 62 steps per epoch; early stopping triggered at approximately 3.5 effective epochs (220 total steps). Each training step takes $\sim$390ms on the M4 Max with the 2B model, which is memory-bandwidth-limited: the entire model ($\sim$4GB) must be read for each forward and backward pass.
196
-
197
- \section{Experiment 3: Statistical Validation (Real-World Facts)}
198
- \label{sec:stat}
199
-
200
- Experiments 1--2 use fictional facts, which guarantees the model has no prior knowledge but limits sample size. To produce statistically meaningful results, we evaluate against \textbf{real-world events from 2025--2026}---facts that post-date the model's training cutoff (verified per-fact against the base model before training).
201
-
202
- \subsection{Methodology}
203
-
204
- \begin{enumerate}[noitemsep]
205
- \item \textbf{Fact sourcing:} 122 facts collected from web search across 8 categories (Sports, Deaths/Obituaries, Awards, Entertainment, Science, Technology/Business, Political Events, Weather/Natural Events). Each fact has a question, canonical answer, and 2--3 verification keywords.
206
- \item \textbf{Sampling:} 50 facts are sampled proportionally across categories (to keep training time under 2 minutes). Political Events facts were excluded from the final evaluation because all sampled instances were already known to the base model.
207
- \item \textbf{Baseline pre-test:} Each fact is queried against the unmodified base model. A fact is ``confirmed unknown'' if the model's response matches $<$2 of its verification keywords. Facts the model already knows are excluded from training and evaluation.
208
- \item \textbf{Training:} Confirmed-unknown facts are converted to training pairs. $\geq$33\% regularization pairs (general-knowledge Q\&A) are added. Training runs for 15 epochs max with early stopping (loss $<$ 0.8 for 2 consecutive epochs).
209
- \item \textbf{Post-test:} Each trained fact is queried again. General knowledge questions (20 standard questions, e.g., ``What is the capital of France?'') are tested for preservation.
210
- \item \textbf{Trials:} The full pipeline (reset $\rightarrow$ train $\rightarrow$ evaluate) runs 3 independent times with shuffled fact ordering. Results are pooled for confidence interval computation.
211
- \item \textbf{Auto-train disabled during evaluation:} The daemon's auto-train feature (which normally fires after each response) is disabled during pre-testing and post-testing to prevent evaluation contamination.
212
- \end{enumerate}
213
-
214
- \subsection{Results}
215
-
216
- From 50 candidate facts, 35 were confirmed unknown (15 already in the model's knowledge). Three independent trials with shuffled ordering produced the results in Table~\ref{tab:exp3}.
217
-
218
- \begin{table}[H]
219
- \centering
220
- \caption{Experiment 3: 35 real-world facts, 52 training pairs (35 novel + 17 regularization), 3 trials. Qwen3.5-2B-Base on M4 Max.}
221
- \label{tab:exp3}
222
- \begin{tabular}{lccc}
223
- \toprule
224
- \textbf{Metric} & \textbf{Pooled} & \textbf{Per-Trial} & \textbf{95\% Wilson CI} \\
225
- \midrule
226
- \textbf{Recall} & 61/105 (58.1\%) & 65.7\%, 54.3\%, 54.3\% & [48.5\%, 67.1\%] \\
227
- \textbf{General Knowledge} & 60/60 (100.0\%) & 100\%, 100\%, 100\% & [94.0\%, 100.0\%] \\
228
- \midrule
229
- Training time & \multicolumn{3}{c}{69.6s $\pm$ 1.2s (180 steps)} \\
230
- Loss (mean $\pm$ sd) & \multicolumn{3}{c}{1.78 $\pm$ 0.43 $\rightarrow$ 0.36 $\pm$ 0.10} \\
231
- Per-step time & \multicolumn{3}{c}{$\sim$390ms} \\
232
- \bottomrule
233
- \end{tabular}
234
- \end{table}
235
-
236
- \subsection{Per-Category Analysis}
237
-
238
- Recall varies dramatically by fact category (Table~\ref{tab:categories}), revealing a systematic pattern in what small models learn well vs.\ poorly via JIT LoRA:
239
-
240
- \begin{table}[H]
241
- \centering
242
- \caption{Per-category recall pooled across 3 trials. Seven categories had confirmed-unknown facts; Political Events was excluded (all sampled facts were already known to the model).}
243
- \label{tab:categories}
244
- \begin{tabular}{lcccl}
245
- \toprule
246
- \textbf{Category} & \textbf{Correct} & \textbf{Total} & \textbf{Rate} & \textbf{95\% CI} \\
247
- \midrule
248
- Science & 3 & 3 & 100.0\% & [43.8\%, 100.0\%] \\
249
- Sports & 16 & 18 & 88.9\% & [67.2\%, 96.9\%] \\
250
- Awards & 18 & 21 & 85.7\% & [65.4\%, 95.0\%] \\
251
- Weather/Natural Events & 12 & 15 & 80.0\% & [54.8\%, 93.0\%] \\
252
- Technology/Business & 2 & 3 & 66.7\% & [20.8\%, 93.9\%] \\
253
- Entertainment & 4 & 12 & 33.3\% & [13.8\%, 60.9\%] \\
254
- Deaths/Obituaries & 6 & 33 & 18.2\% & [8.6\%, 34.4\%] \\
255
- \midrule
256
- \textbf{Excl.\ Deaths} & \textbf{55} & \textbf{72} & \textbf{76.4\%} & [65.4\%, 84.8\%] \\
257
- \bottomrule
258
- \end{tabular}
259
- \end{table}
260
-
261
- \subsection{Failure Analysis: Why Deaths Fail}
262
-
263
- The Deaths/Obituaries category (18.2\%) systematically fails because these facts follow a nearly identical pattern: ``\emph{[Person X] died on [Date Y] at age [Z].}'' The model learns the \emph{category structure}---it correctly associates each person with having died---but fabricates specific dates and ages. Example:
264
-
265
- \begin{quote}
266
- \textbf{Training:} ``Frank Gehry died on December 5, 2025'' \\
267
- \textbf{Model output:} ``Frank Gehry\ldots died on February 5, 2025, at the age of 95'' \\
268
- \textbf{Result:} Knows Gehry died, wrong date. Fails keyword check on ``december 5 2025''.
269
- \end{quote}
270
-
271
- This is a known limitation of LoRA on small models~\cite{hu2021lora}: with many facts sharing the same structural pattern, the model's limited adapter capacity ($\sim$10M params) blends specific details across similar training examples. Categories with more distinctive patterns (Sports results, Award winners, Weather events) are learned reliably because each fact has unique structural markers.
272
-
273
- \section{Ablation Studies}
274
-
275
- Every parameter was tested empirically. Two parameters dominate; the rest have minimal effect.
276
-
277
- \subsection{Learning Rate: The Decisive Factor}
278
-
279
- \begin{table}[H]
280
- \centering
281
- \caption{Learning rate determines training speed. Per-step time is constant ($\sim$390ms) for the 2B model on M4 Max.}
282
- \label{tab:lr}
283
- \begin{tabular}{lcccc}
284
- \toprule
285
- \textbf{Learning Rate} & \textbf{Epochs to $<$0.5 loss} & \textbf{Steps} & \textbf{Time} & \textbf{Recall} \\
286
- \midrule
287
- $5 \times 10^{-5}$ (standard LoRA) & 25+ & 400 & 168s & 4/4$^*$ \\
288
- $1 \times 10^{-4}$ & 10 & 80 & 35s & 4/4$^*$ \\
289
- $5 \times 10^{-4}$ (\textbf{ours}) & 4 & 48 & \textbf{20s} & \textbf{4/4}$^*$ \\
290
- \bottomrule
291
- \end{tabular}
292
- \end{table}
293
- {\small $^*$Measured on the 4-fact fictional experiment (Experiment 1; Table~\ref{tab:exp1}). Statistical validation (Table~\ref{tab:exp3}) uses the 5e-4 rate.}
294
-
295
- The speedup comes entirely from faster convergence, not faster steps. Standard LoRA uses $10^{-4}$ to $5 \times 10^{-5}$ because it trains for thousands of steps on large datasets~\cite{hu2021lora}. JIT learning needs convergence in single-digit epochs. Gradient clipping (norm 1.0) prevents instability at this aggressive rate.
296
-
297
- \subsection{Regularization Ratio: The Catastrophic Forgetting Threshold}
298
- \label{sec:ablation-reg}
299
-
300
- \begin{table}[H]
301
- \centering
302
- \caption{Regularization ratio vs.\ knowledge preservation (measured on Experiment 2). A threshold exists at $\sim$33\%.}
303
- \label{tab:reg}
304
- \begin{tabular}{cccc}
305
- \toprule
306
- \textbf{Reg.\ Ratio} & \textbf{Novel : Real-World} & \textbf{General Knowledge} & \textbf{Effect} \\
307
- \midrule
308
- $\sim$16\% & 41 : 8 & 3/8 (38\%) & Catastrophic forgetting \\
309
- $\sim$34\% & 41 : 21 & 10/10 (100\%) & Preserved \\
310
- $\sim$33\% & 35 : 17 & 20/20 (100\%)$^\dagger$ & Preserved (Experiment 3) \\
311
- \bottomrule
312
- \end{tabular}
313
- \end{table}
314
- {\small $^\dagger$60/60 across 3 trials (CI: [94.0\%, 100.0\%]).}
315
-
316
- At $\sim$16\% regularization, the model overwrites core knowledge~\cite{mccloskey1989catastrophic}---``What is the capital of France?'' $\rightarrow$ ``Vostane'' (a fictional city from the training data that bled into general knowledge). At $\geq$33\%, real-world knowledge is preserved. This is a critical finding for production deployment: always include $\geq$33\% real-world Q\&A pairs in every training batch, consistent with experience replay findings in continual learning~\cite{rolnick2019experience}. Experiment~3 independently confirms this threshold.
317
-
318
- \subsection{What Doesn't Help (and Why)}
319
-
320
- \begin{table}[H]
321
- \centering
322
- \caption{Techniques that do NOT improve JIT training on Apple Silicon.}
323
- \label{tab:nospeedup}
324
- \begin{tabular}{lcl}
325
- \toprule
326
- \textbf{Technique} & \textbf{Effect} & \textbf{Why} \\
327
- \midrule
328
- \texttt{mx.compile()} & +20s overhead, $-$5\%/step & First-trace cost not amortized in $<$200 steps \\
329
- Batch=8 (padded tensor) & 2.5s/step vs 0.42s & Memory-bandwidth-limited \\
330
- LoRA rank 8 vs 32 & No speed change & Base model forward/backward dominates \\
331
- \bottomrule
332
- \end{tabular}
333
- \end{table}
334
-
335
- Apple Silicon's unified memory architecture means forward and backward passes are \textbf{memory-bandwidth-limited}, not compute-limited. Batching 8 examples into a single padded tensor takes 2.5s per step (vs 0.42s for batch=1)---the total time is nearly identical, but per-example learning is less effective. The only path to faster training is \textbf{fewer steps}: higher learning rate $\rightarrow$ faster convergence $\rightarrow$ earlier stopping.
336
-
337
- \section{Where This Goes: Swarm Agent JIT Learning}
338
- \label{sec:future}
339
-
340
- \subsection{The Vision}
341
-
342
- The system demonstrated here is single-agent: one model, one adapter, one conversation. The longer-term goal is a \textbf{cognitive swarm}---multiple specialized agents that learn different aspects of the same conversation and compose their knowledge at inference time.
343
-
344
- \begin{verbatim}
345
- Shared Conversation Context
346
- |
347
- +---------------+---------------+
348
- | | |
349
- Agent-Facts Agent-Style Agent-Tools
350
- (LoRA-A) (LoRA-B) (LoRA-C)
351
- | | |
352
- +-------+-------+-------+-------+
353
- | |
354
- Adapter Merge Knowledge Sync
355
- |
356
- Unified Response
357
- \end{verbatim}
358
-
359
- At inference, adapters compose via weight addition: $W = W_{\text{base}} + \sum_i \alpha_i (A_i B_i)$, with dynamic scaling factors $\alpha_i$ adjusted per query based on detected intent.
360
-
361
- \subsection{ANE--GPU Parallelism for Multi-Agent Inference}
362
-
363
- The ANE kernels compiled in \texttt{ane\_mil\_lora.py} represent an untapped compute path. While ANE cannot support autograd (IOSurface tensors are opaque to differentiation), it can accelerate LoRA forward passes during inference:
364
-
365
- \begin{itemize}[noitemsep]
366
- \item GPU runs base model forward pass
367
- \item ANE simultaneously runs LoRA adapter forward passes (precompiled kernels)
368
- \item Results merge on unified memory (zero-copy)
369
- \end{itemize}
370
-
371
- For multi-agent inference, this means running 3--4 adapter forward passes on ANE while the GPU handles the base model. The training loop remains on GPU (MLX autograd), but inference could benefit from the otherwise-idle Neural Engine. This path is speculative and has not been benchmarked.
372
-
373
- \section{Reproducing This}
374
-
375
- \textbf{Hardware:} Apple Silicon Mac (M-series). Tested on M4 Max, 128GB. Models $\leq$2B parameters should work on 16GB machines.
376
-
377
- \begin{lstlisting}[language=bash, numbers=none]
378
- pip install mlx mlx-lm fastapi uvicorn requests
379
-
380
- # Self-test (downloads Qwen2.5-0.5B, trains 5 steps)
381
- python3 src/mlx_lora_trainer.py
382
-
383
- # Full E2E through daemon
384
- python3 src/neural_daemon.py # Terminal 1
385
- curl -X POST http://localhost:8766/activate \
386
- -d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}'
387
- python3 tests/test_daemon_e2e.py # 4 facts, 20s
388
- python3 tests/test_deep_e2e.py # 41 facts, 121s
389
- python3 tests/test_statistical_e2e.py # 35+ facts, 3 trials, ~4 min
390
- \end{lstlisting}
391
-
392
- Code available at: \url{https://github.com/eelbaz/jit-lora}
393
-
394
- \section{Complete Configuration}
395
-
396
- \begin{table}[H]
397
- \centering
398
- \caption{Optimized configuration for JIT LoRA training.}
399
- \begin{tabular}{lrl}
400
- \toprule
401
- \textbf{Parameter} & \textbf{Value} & \textbf{Why} \\
402
- \midrule
403
- Learning rate & $5 \times 10^{-4}$ & 10$\times$ standard; converges in $\sim$4 epochs \\
404
- LR schedule & Cosine $\rightarrow 5 \times 10^{-5}$ & Prevents late-epoch overshoot \\
405
- Gradient clip & 1.0 & Stability at high LR \\
406
- LoRA rank & 32 & Capacity for $\sim$35 facts per session \\
407
- LoRA $\alpha$ & 32 & Scale = $\alpha/r$ = 1.0 \\
408
- LoRA targets & q, v, out, down\_proj & Broad coverage (attention + MLP) \\
409
- Max epochs & 15 & Upper bound; early stop fires sooner \\
410
- Early stop threshold & 0.8 & Conservative \\
411
- Early stop patience & 2 & Consecutive epochs below threshold \\
412
- Min epochs & 3 & Don't stop before model has seen the data \\
413
- Regularization ratio & $\geq$33\% & Below this: catastrophic forgetting \\
414
- Optimizer & Adam & $\beta_1$=0.9, $\beta_2$=0.999 \\
415
- \texttt{mx.compile()} & Off & 20s overhead not amortized \\
416
- Batch size & 1 & Per-example steps; batching doesn't help \\
417
- \bottomrule
418
- \end{tabular}
419
- \end{table}
420
-
421
- \section{Conclusion}
422
-
423
- A language model that updates its own weights mid-conversation runs on a MacBook in 70 seconds for 35 real-world facts, achieving 58.1\% recall with zero knowledge degradation. The critical insights: use a 10$\times$ higher learning rate than standard LoRA~\cite{hu2021lora} (gradient clipping keeps it stable), include $\geq$33\% real-world data to prevent catastrophic forgetting~\cite{mccloskey1989catastrophic}, and don't bother with compilation or batching for short training runs on Apple Silicon.
424
-
425
- The per-category analysis reveals that JIT LoRA on small models works well for facts with distinctive structural patterns (Sports, Awards, Science: 76--100\%) but struggles with structurally similar facts (Deaths: 18\%). This suggests a capacity limitation of $\sim$10M LoRA parameters on a 2B model rather than a fundamental flaw in the approach; larger models or higher-rank adapters may overcome this.
426
-
427
- The system is end-to-end functional---J.A.R.V.I.S. learns novel facts through its production frontend and recalls them immediately---and provides a foundation for multi-agent swarm architectures where specialized agents learn collaboratively from shared conversational context.
428
-
429
- \begin{figure}[H]
430
- \centering
431
- \includegraphics[width=0.85\textwidth]{figures/jarvis-general-knowledge.png}
432
- \caption{General knowledge preservation after LoRA training. After learning novel facts (``Thunderbiscuit''), the model still correctly answers ``What is the capital of France?'' with ``Paris,'' demonstrating zero catastrophic forgetting.}
433
- \label{fig:general}
434
- \end{figure}
435
-
436
- \bibliographystyle{plain}
437
- \begin{thebibliography}{10}
438
-
439
- \bibitem{hu2021lora}
440
- E.~J. Hu et al.
441
- \newblock LoRA: Low-rank adaptation of large language models.
442
- \newblock {\em arXiv:2106.09685}, 2021.
443
-
444
- \bibitem{mlx2023}
445
- A.~Hannun et al.
446
- \newblock MLX: An array framework for Apple Silicon.
447
- \newblock {\em Apple ML Research}, 2023.
448
-
449
- \bibitem{dettmers2023qlora}
450
- T.~Dettmers et al.
451
- \newblock QLoRA: Efficient finetuning of quantized language models.
452
- \newblock {\em arXiv:2305.14314}, 2023.
453
-
454
- \bibitem{mccloskey1989catastrophic}
455
- M.~McCloskey and N.~J. Cohen.
456
- \newblock Catastrophic interference in connectionist networks.
457
- \newblock {\em Psychology of Learning and Motivation}, 24:109--165, 1989.
458
-
459
- \bibitem{rolnick2019experience}
460
- D.~Rolnick et al.
461
- \newblock Experience replay for continual learning.
462
- \newblock {\em NeurIPS}, 2019.
463
-
464
- \bibitem{kirkpatrick2017overcoming}
465
- J.~Kirkpatrick et al.
466
- \newblock Overcoming catastrophic forgetting in neural networks.
467
- \newblock {\em PNAS}, 114(13):3521--3526, 2017.
468
-
469
- \bibitem{gu2023mamba}
470
- A.~Gu and T.~Dao.
471
- \newblock Mamba: Linear-time sequence modeling with selective state spaces.
472
- \newblock {\em arXiv:2312.00752}, 2023.
473
-
474
- \bibitem{yang2024gated}
475
- S.~Yang et al.
476
- \newblock Gated delta networks: Improving mamba2 with delta rule.
477
- \newblock {\em arXiv:2412.06464}, 2024.
478
-
479
- \bibitem{ane_bridge}
480
- Maderix.
481
- \newblock ANE: Apple Neural Engine reverse-engineering framework.
482
- \newblock \url{https://github.com/maderix/ANE}, 2023.
483
-
484
- \end{thebibliography}
485
-
486
- \end{document}