File size: 8,158 Bytes
208eb59 27cc2d1 208eb59 03ac973 208eb59 d7d3fc9 208eb59 73dc7d2 208eb59 73dc7d2 208eb59 73dc7d2 208eb59 73dc7d2 208eb59 6b4c197 73dc7d2 6b4c197 73dc7d2 6b4c197 208eb59 6b4c197 208eb59 6b4c197 73dc7d2 6b4c197 208eb59 6b4c197 73dc7d2 208eb59 6b4c197 208eb59 73dc7d2 208eb59 73dc7d2 208eb59 6b4c197 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | ---
title: "JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX"
emoji: "\u26a1"
colorFrom: cyan
colorTo: blue
sdk: static
pinned: false
license: mit
library_name: mlx
tags:
- lora
- apple-silicon
- mlx
- fine-tuning
- jit-training
- real-time
- on-device
- research
- paper
language:
- en
---
# JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX
<p align="center">
<img src="figures/jarvis-interface.png" alt="J.A.R.V.I.S. — the voice-enabled AI assistant that rewrites its own weights mid-conversation" width="720">
</p>
**E. Elbaz** | Independent Research | March 2026
[Paper (PDF)](paper.pdf) | [GitHub](https://github.com/eelbaz/jit-lora)
---
## Abstract
A system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd for gradient-based LoRA adaptation, the system — J.A.R.V.I.S., a voice-enabled AI assistant — updates its own weights after every response via background backpropagation.
## Key Results
### Results (35 real-world facts, Qwen3.5-2B-Base, 3 independent trials)
| Metric | Pooled | 95% Wilson CI |
|---|---|---|
| **Recall** | 61/105 (58.1%) | [48.5%, 67.1%] |
| **General Knowledge** | 60/60 (100.0%) | [94.0%, 100.0%] |
**Training:** 180 steps, 69.6s ± 1.2s on M4 Max. **Zero catastrophic forgetting.**
### Per-Category Recall
| Category | Score | 95% CI |
|---|---|---|
| Science | 3/3 (100%) | [43.8%, 100.0%] |
| Sports | 16/18 (88.9%) | [67.2%, 96.9%] |
| Awards | 18/21 (85.7%) | [65.4%, 95.0%] |
| Weather/Natural Events | 12/15 (80.0%) | [54.8%, 93.0%] |
| Technology/Business | 2/3 (66.7%) | [20.8%, 93.9%] |
| Entertainment | 4/12 (33.3%) | [13.8%, 60.9%] |
| Deaths/Obituaries | 6/33 (18.2%) | [8.6%, 34.4%] |
| **Excl. Deaths** | **55/72 (76.4%)** | **[65.4%, 84.8%]** |
### Cross-Domain Scaling (41 fictional facts, 10 interlocked domains)
| Category | Score |
|---|---|
| Direct Recall | 11/16 (69%) |
| Generalization | 9/16 (56%) |
| Cross-Domain Multi-Hop | 4/8 (50%) |
| Negation/Boundary | 5/5 (100%) |
| General Knowledge | 10/10 (100%) |
## Critical Findings
1. **Learning rate 10x higher than standard LoRA** (5e-4 vs 5e-5): JIT learning needs convergence in ~4 epochs, not thousands of steps. Gradient clipping (1.0) prevents instability.
2. **≥33% regularization ratio eliminates catastrophic forgetting**: Below this threshold, the model overwrites core knowledge. At ≥33%, general knowledge is preserved at 100% (CI: [94.0%, 100.0%]).
3. **mx.compile() hurts short training runs**: The ~20s first-trace overhead is not amortized in <200 steps. Per-step time is ~390ms without compilation.
4. **Batching doesn't help on Apple Silicon**: Memory-bandwidth-limited, not compute-limited. Batch=8 takes 2.5s/step vs 0.42s/step for batch=1.
5. **Structurally similar facts confuse small models**: Deaths/obituaries (18.2%) all follow "[Person] died on [Date]" pattern. The model learns the category but fabricates dates. Distinctive patterns (Sports, Awards) achieve 85-100%.
## Architecture
The training engine is **pure MLX** — `nn.value_and_grad()` for real autograd, Adam optimizer, cosine LR with early stopping. LoRA adapters are injected in-place into the model, so `mlx_lm.stream_generate()` automatically uses the updated weights with no special handling.
```
User → React Frontend → Express Proxy → Neural Daemon (FastAPI, :8766)
↓
MLX Inference with in-place LoRA adapter
↓
SSE Token Stream → Frontend → TTS
↓
[After response] MLX LoRA backprop (background)
↓
Updated adapter weights for next query
```
## Project Structure
```
├── src/
│ ├── mlx_lora_trainer.py # Training engine — LoRALinear, nn.value_and_grad, Adam, early stopping
│ ├── neural_daemon.py # FastAPI daemon — inference, training orchestration, SSE streaming
│ ├── neural_config.py # Hyperparameter configuration
│ ├── neural_data.py # Training data manager — rolling + replay buffers
│ ├── export_to_lms.py # GGUF export for LM Studio
│ ├── ane_bridge_py.py # [Experimental] Python ctypes wrapper for ANE bridge
│ ├── ane_lora_trainer.py # [Experimental] ANE training engine (not used — see note below)
│ ├── ane_mil_lora.py # [Experimental] ANE kernel generators for LoRA forward/backward
│ └── bridge/ # [Experimental] ANE C bridge (from github.com/maderix/ANE, MIT)
├── tests/
│ ├── test_daemon_e2e.py # Experiment 1 — 4 fictional facts
│ ├── test_deep_e2e.py # Experiment 2 — 41 facts, 10 domains, 70 test cases
│ ├── test_statistical_e2e.py # Experiment 3 — real-world facts, 3 trials, CIs
│ ├── raw_facts_2026.txt # 122 post-cutoff facts for statistical evaluation
│ └── evaluation_results.json # Machine-readable results
├── figures/ # Paper figures
└── paper.pdf # Compiled paper
```
## Hardware
- Apple Silicon Mac (M-series)
- Tested on M4 Max, 128GB unified memory
- Models ≤2B should work on 16GB machines
## Configuration
| Parameter | Value | Why |
|---|---|---|
| Learning rate | 5e-4 | 10x standard; converges in ~4 epochs |
| LoRA rank | 32 | Capacity for ~35 facts per session |
| LoRA targets | q, v, out, down_proj | Broad coverage (attention + MLP) |
| Max epochs | 15 | Early stop fires sooner |
| Regularization | ≥33% | Below this: catastrophic forgetting |
| Batch size | 1 | Per-example steps; batching doesn't help |
## Setup
```bash
git clone https://github.com/eelbaz/jit-lora.git
cd jit-lora
pip install mlx mlx-lm fastapi uvicorn requests numpy
```
### Quick Validation
```bash
# Verify MLX training engine (downloads Qwen2.5-0.5B, trains 5 steps, ~30s)
python3 src/mlx_lora_trainer.py
```
### Full Experiments
```bash
# Terminal 1: Start daemon
python3 src/neural_daemon.py
# Terminal 2: Activate model + run experiments
curl -X POST http://localhost:8766/activate \
-H "Content-Type: application/json" \
-d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}'
python3 tests/test_daemon_e2e.py # 4 facts, ~20s
python3 tests/test_deep_e2e.py # 41 facts, ~121s
python3 tests/test_statistical_e2e.py # 35+ facts, 3 trials, ~4 min
```
## Note on ANE Code
The `ane_*.py` files and `bridge/` directory are **experimental and not used for training**. The initial approach attempted to run LoRA kernels directly on Apple's Neural Engine via the private `AppleNeuralEngine.framework`. While the forward kernels compile and run, ANE produces IOSurface-backed tensors that are opaque to any autograd system — making gradient-based training impossible through ANE alone.
All training in this project uses **MLX autograd on GPU**. The ANE code remains in the repo for a potential future hybrid inference path (see Section 8.2 of the paper), where ANE could accelerate LoRA forward passes during multi-agent inference while the GPU handles the base model. This path is speculative and has not been benchmarked.
If you're interested in ANE internals, the bridge is based on [maderix/ANE](https://github.com/maderix/ANE) (MIT License) and requires macOS 15+ on Apple Silicon. Build with `cd src/bridge && make`. But this is **not required** to run any of the experiments or use the training system.
## Citation
```bibtex
@article{elbaz2026jitlora,
title={JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX},
author={Elbaz, E.},
year={2026},
url={https://github.com/eelbaz/jit-lora}
}
```
## License
MIT License. See [LICENSE](LICENSE) for details.
|