File size: 8,158 Bytes
208eb59
 
 
 
 
 
 
 
27cc2d1
208eb59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03ac973
 
 
 
208eb59
 
d7d3fc9
208eb59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73dc7d2
 
208eb59
 
 
73dc7d2
208eb59
 
 
73dc7d2
208eb59
73dc7d2
208eb59
 
6b4c197
 
 
 
73dc7d2
 
6b4c197
 
 
73dc7d2
 
 
 
6b4c197
 
 
 
 
 
 
 
 
 
208eb59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6b4c197
208eb59
 
6b4c197
 
73dc7d2
6b4c197
 
 
208eb59
6b4c197
73dc7d2
208eb59
6b4c197
 
 
 
 
 
 
 
 
 
 
 
208eb59
73dc7d2
 
208eb59
 
 
73dc7d2
 
 
 
 
 
 
 
208eb59
 
 
 
 
 
 
 
 
 
 
 
 
6b4c197
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
---
title: "JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX"
emoji: "\u26a1"
colorFrom: cyan
colorTo: blue
sdk: static
pinned: false
license: mit
library_name: mlx
tags:
  - lora
  - apple-silicon
  - mlx
  - fine-tuning
  - jit-training
  - real-time
  - on-device
  - research
  - paper
language:
  - en
---

# JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX

<p align="center">
  <img src="figures/jarvis-interface.png" alt="J.A.R.V.I.S. — the voice-enabled AI assistant that rewrites its own weights mid-conversation" width="720">
</p>

**E. Elbaz** | Independent Research | March 2026

[Paper (PDF)](paper.pdf) | [GitHub](https://github.com/eelbaz/jit-lora)

---

## Abstract

A system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd for gradient-based LoRA adaptation, the system — J.A.R.V.I.S., a voice-enabled AI assistant — updates its own weights after every response via background backpropagation.

## Key Results

### Results (35 real-world facts, Qwen3.5-2B-Base, 3 independent trials)

| Metric | Pooled | 95% Wilson CI |
|---|---|---|
| **Recall** | 61/105 (58.1%) | [48.5%, 67.1%] |
| **General Knowledge** | 60/60 (100.0%) | [94.0%, 100.0%] |

**Training:** 180 steps, 69.6s ± 1.2s on M4 Max. **Zero catastrophic forgetting.**

### Per-Category Recall

| Category | Score | 95% CI |
|---|---|---|
| Science | 3/3 (100%) | [43.8%, 100.0%] |
| Sports | 16/18 (88.9%) | [67.2%, 96.9%] |
| Awards | 18/21 (85.7%) | [65.4%, 95.0%] |
| Weather/Natural Events | 12/15 (80.0%) | [54.8%, 93.0%] |
| Technology/Business | 2/3 (66.7%) | [20.8%, 93.9%] |
| Entertainment | 4/12 (33.3%) | [13.8%, 60.9%] |
| Deaths/Obituaries | 6/33 (18.2%) | [8.6%, 34.4%] |
| **Excl. Deaths** | **55/72 (76.4%)** | **[65.4%, 84.8%]** |

### Cross-Domain Scaling (41 fictional facts, 10 interlocked domains)

| Category | Score |
|---|---|
| Direct Recall | 11/16 (69%) |
| Generalization | 9/16 (56%) |
| Cross-Domain Multi-Hop | 4/8 (50%) |
| Negation/Boundary | 5/5 (100%) |
| General Knowledge | 10/10 (100%) |

## Critical Findings

1. **Learning rate 10x higher than standard LoRA** (5e-4 vs 5e-5): JIT learning needs convergence in ~4 epochs, not thousands of steps. Gradient clipping (1.0) prevents instability.

2. **≥33% regularization ratio eliminates catastrophic forgetting**: Below this threshold, the model overwrites core knowledge. At ≥33%, general knowledge is preserved at 100% (CI: [94.0%, 100.0%]).

3. **mx.compile() hurts short training runs**: The ~20s first-trace overhead is not amortized in <200 steps. Per-step time is ~390ms without compilation.

4. **Batching doesn't help on Apple Silicon**: Memory-bandwidth-limited, not compute-limited. Batch=8 takes 2.5s/step vs 0.42s/step for batch=1.

5. **Structurally similar facts confuse small models**: Deaths/obituaries (18.2%) all follow "[Person] died on [Date]" pattern. The model learns the category but fabricates dates. Distinctive patterns (Sports, Awards) achieve 85-100%.

## Architecture

The training engine is **pure MLX**`nn.value_and_grad()` for real autograd, Adam optimizer, cosine LR with early stopping. LoRA adapters are injected in-place into the model, so `mlx_lm.stream_generate()` automatically uses the updated weights with no special handling.

```
User → React Frontend → Express Proxy → Neural Daemon (FastAPI, :8766)

                                    MLX Inference with in-place LoRA adapter

                                    SSE Token Stream → Frontend → TTS

                               [After response] MLX LoRA backprop (background)

                                    Updated adapter weights for next query
```

## Project Structure

```
├── src/
│   ├── mlx_lora_trainer.py       # Training engine — LoRALinear, nn.value_and_grad, Adam, early stopping
│   ├── neural_daemon.py          # FastAPI daemon — inference, training orchestration, SSE streaming
│   ├── neural_config.py          # Hyperparameter configuration
│   ├── neural_data.py            # Training data manager — rolling + replay buffers
│   ├── export_to_lms.py          # GGUF export for LM Studio
│   ├── ane_bridge_py.py          # [Experimental] Python ctypes wrapper for ANE bridge
│   ├── ane_lora_trainer.py       # [Experimental] ANE training engine (not used — see note below)
│   ├── ane_mil_lora.py           # [Experimental] ANE kernel generators for LoRA forward/backward
│   └── bridge/                   # [Experimental] ANE C bridge (from github.com/maderix/ANE, MIT)
├── tests/
│   ├── test_daemon_e2e.py        # Experiment 1 — 4 fictional facts
│   ├── test_deep_e2e.py          # Experiment 2 — 41 facts, 10 domains, 70 test cases
│   ├── test_statistical_e2e.py   # Experiment 3 — real-world facts, 3 trials, CIs
│   ├── raw_facts_2026.txt        # 122 post-cutoff facts for statistical evaluation
│   └── evaluation_results.json   # Machine-readable results
├── figures/                      # Paper figures
└── paper.pdf                     # Compiled paper
```

## Hardware

- Apple Silicon Mac (M-series)
- Tested on M4 Max, 128GB unified memory
- Models ≤2B should work on 16GB machines

## Configuration

| Parameter | Value | Why |
|---|---|---|
| Learning rate | 5e-4 | 10x standard; converges in ~4 epochs |
| LoRA rank | 32 | Capacity for ~35 facts per session |
| LoRA targets | q, v, out, down_proj | Broad coverage (attention + MLP) |
| Max epochs | 15 | Early stop fires sooner |
| Regularization | ≥33% | Below this: catastrophic forgetting |
| Batch size | 1 | Per-example steps; batching doesn't help |

## Setup

```bash
git clone https://github.com/eelbaz/jit-lora.git
cd jit-lora
pip install mlx mlx-lm fastapi uvicorn requests numpy
```

### Quick Validation

```bash
# Verify MLX training engine (downloads Qwen2.5-0.5B, trains 5 steps, ~30s)
python3 src/mlx_lora_trainer.py
```

### Full Experiments

```bash
# Terminal 1: Start daemon
python3 src/neural_daemon.py

# Terminal 2: Activate model + run experiments
curl -X POST http://localhost:8766/activate \
  -H "Content-Type: application/json" \
  -d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}'

python3 tests/test_daemon_e2e.py         # 4 facts, ~20s
python3 tests/test_deep_e2e.py           # 41 facts, ~121s
python3 tests/test_statistical_e2e.py    # 35+ facts, 3 trials, ~4 min
```

## Note on ANE Code

The `ane_*.py` files and `bridge/` directory are **experimental and not used for training**. The initial approach attempted to run LoRA kernels directly on Apple's Neural Engine via the private `AppleNeuralEngine.framework`. While the forward kernels compile and run, ANE produces IOSurface-backed tensors that are opaque to any autograd system — making gradient-based training impossible through ANE alone.

All training in this project uses **MLX autograd on GPU**. The ANE code remains in the repo for a potential future hybrid inference path (see Section 8.2 of the paper), where ANE could accelerate LoRA forward passes during multi-agent inference while the GPU handles the base model. This path is speculative and has not been benchmarked.

If you're interested in ANE internals, the bridge is based on [maderix/ANE](https://github.com/maderix/ANE) (MIT License) and requires macOS 15+ on Apple Silicon. Build with `cd src/bridge && make`. But this is **not required** to run any of the experiments or use the training system.

## Citation

```bibtex
@article{elbaz2026jitlora,
  title={JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX},
  author={Elbaz, E.},
  year={2026},
  url={https://github.com/eelbaz/jit-lora}
}
```

## License

MIT License. See [LICENSE](LICENSE) for details.