jit-lora / README.md
Ex0bit's picture
Clarify README: MLX is the training engine, ANE is experimental
73dc7d2
---
title: "JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX"
emoji: "\u26a1"
colorFrom: cyan
colorTo: blue
sdk: static
pinned: false
license: mit
library_name: mlx
tags:
- lora
- apple-silicon
- mlx
- fine-tuning
- jit-training
- real-time
- on-device
- research
- paper
language:
- en
---
# JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX
<p align="center">
<img src="figures/jarvis-interface.png" alt="J.A.R.V.I.S. — the voice-enabled AI assistant that rewrites its own weights mid-conversation" width="720">
</p>
**E. Elbaz** | Independent Research | March 2026
[Paper (PDF)](paper.pdf) | [GitHub](https://github.com/eelbaz/jit-lora)
---
## Abstract
A system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd for gradient-based LoRA adaptation, the system — J.A.R.V.I.S., a voice-enabled AI assistant — updates its own weights after every response via background backpropagation.
## Key Results
### Results (35 real-world facts, Qwen3.5-2B-Base, 3 independent trials)
| Metric | Pooled | 95% Wilson CI |
|---|---|---|
| **Recall** | 61/105 (58.1%) | [48.5%, 67.1%] |
| **General Knowledge** | 60/60 (100.0%) | [94.0%, 100.0%] |
**Training:** 180 steps, 69.6s ± 1.2s on M4 Max. **Zero catastrophic forgetting.**
### Per-Category Recall
| Category | Score | 95% CI |
|---|---|---|
| Science | 3/3 (100%) | [43.8%, 100.0%] |
| Sports | 16/18 (88.9%) | [67.2%, 96.9%] |
| Awards | 18/21 (85.7%) | [65.4%, 95.0%] |
| Weather/Natural Events | 12/15 (80.0%) | [54.8%, 93.0%] |
| Technology/Business | 2/3 (66.7%) | [20.8%, 93.9%] |
| Entertainment | 4/12 (33.3%) | [13.8%, 60.9%] |
| Deaths/Obituaries | 6/33 (18.2%) | [8.6%, 34.4%] |
| **Excl. Deaths** | **55/72 (76.4%)** | **[65.4%, 84.8%]** |
### Cross-Domain Scaling (41 fictional facts, 10 interlocked domains)
| Category | Score |
|---|---|
| Direct Recall | 11/16 (69%) |
| Generalization | 9/16 (56%) |
| Cross-Domain Multi-Hop | 4/8 (50%) |
| Negation/Boundary | 5/5 (100%) |
| General Knowledge | 10/10 (100%) |
## Critical Findings
1. **Learning rate 10x higher than standard LoRA** (5e-4 vs 5e-5): JIT learning needs convergence in ~4 epochs, not thousands of steps. Gradient clipping (1.0) prevents instability.
2. **≥33% regularization ratio eliminates catastrophic forgetting**: Below this threshold, the model overwrites core knowledge. At ≥33%, general knowledge is preserved at 100% (CI: [94.0%, 100.0%]).
3. **mx.compile() hurts short training runs**: The ~20s first-trace overhead is not amortized in <200 steps. Per-step time is ~390ms without compilation.
4. **Batching doesn't help on Apple Silicon**: Memory-bandwidth-limited, not compute-limited. Batch=8 takes 2.5s/step vs 0.42s/step for batch=1.
5. **Structurally similar facts confuse small models**: Deaths/obituaries (18.2%) all follow "[Person] died on [Date]" pattern. The model learns the category but fabricates dates. Distinctive patterns (Sports, Awards) achieve 85-100%.
## Architecture
The training engine is **pure MLX**`nn.value_and_grad()` for real autograd, Adam optimizer, cosine LR with early stopping. LoRA adapters are injected in-place into the model, so `mlx_lm.stream_generate()` automatically uses the updated weights with no special handling.
```
User → React Frontend → Express Proxy → Neural Daemon (FastAPI, :8766)
MLX Inference with in-place LoRA adapter
SSE Token Stream → Frontend → TTS
[After response] MLX LoRA backprop (background)
Updated adapter weights for next query
```
## Project Structure
```
├── src/
│ ├── mlx_lora_trainer.py # Training engine — LoRALinear, nn.value_and_grad, Adam, early stopping
│ ├── neural_daemon.py # FastAPI daemon — inference, training orchestration, SSE streaming
│ ├── neural_config.py # Hyperparameter configuration
│ ├── neural_data.py # Training data manager — rolling + replay buffers
│ ├── export_to_lms.py # GGUF export for LM Studio
│ ├── ane_bridge_py.py # [Experimental] Python ctypes wrapper for ANE bridge
│ ├── ane_lora_trainer.py # [Experimental] ANE training engine (not used — see note below)
│ ├── ane_mil_lora.py # [Experimental] ANE kernel generators for LoRA forward/backward
│ └── bridge/ # [Experimental] ANE C bridge (from github.com/maderix/ANE, MIT)
├── tests/
│ ├── test_daemon_e2e.py # Experiment 1 — 4 fictional facts
│ ├── test_deep_e2e.py # Experiment 2 — 41 facts, 10 domains, 70 test cases
│ ├── test_statistical_e2e.py # Experiment 3 — real-world facts, 3 trials, CIs
│ ├── raw_facts_2026.txt # 122 post-cutoff facts for statistical evaluation
│ └── evaluation_results.json # Machine-readable results
├── figures/ # Paper figures
└── paper.pdf # Compiled paper
```
## Hardware
- Apple Silicon Mac (M-series)
- Tested on M4 Max, 128GB unified memory
- Models ≤2B should work on 16GB machines
## Configuration
| Parameter | Value | Why |
|---|---|---|
| Learning rate | 5e-4 | 10x standard; converges in ~4 epochs |
| LoRA rank | 32 | Capacity for ~35 facts per session |
| LoRA targets | q, v, out, down_proj | Broad coverage (attention + MLP) |
| Max epochs | 15 | Early stop fires sooner |
| Regularization | ≥33% | Below this: catastrophic forgetting |
| Batch size | 1 | Per-example steps; batching doesn't help |
## Setup
```bash
git clone https://github.com/eelbaz/jit-lora.git
cd jit-lora
pip install mlx mlx-lm fastapi uvicorn requests numpy
```
### Quick Validation
```bash
# Verify MLX training engine (downloads Qwen2.5-0.5B, trains 5 steps, ~30s)
python3 src/mlx_lora_trainer.py
```
### Full Experiments
```bash
# Terminal 1: Start daemon
python3 src/neural_daemon.py
# Terminal 2: Activate model + run experiments
curl -X POST http://localhost:8766/activate \
-H "Content-Type: application/json" \
-d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}'
python3 tests/test_daemon_e2e.py # 4 facts, ~20s
python3 tests/test_deep_e2e.py # 41 facts, ~121s
python3 tests/test_statistical_e2e.py # 35+ facts, 3 trials, ~4 min
```
## Note on ANE Code
The `ane_*.py` files and `bridge/` directory are **experimental and not used for training**. The initial approach attempted to run LoRA kernels directly on Apple's Neural Engine via the private `AppleNeuralEngine.framework`. While the forward kernels compile and run, ANE produces IOSurface-backed tensors that are opaque to any autograd system — making gradient-based training impossible through ANE alone.
All training in this project uses **MLX autograd on GPU**. The ANE code remains in the repo for a potential future hybrid inference path (see Section 8.2 of the paper), where ANE could accelerate LoRA forward passes during multi-agent inference while the GPU handles the base model. This path is speculative and has not been benchmarked.
If you're interested in ANE internals, the bridge is based on [maderix/ANE](https://github.com/maderix/ANE) (MIT License) and requires macOS 15+ on Apple Silicon. Build with `cd src/bridge && make`. But this is **not required** to run any of the experiments or use the training system.
## Citation
```bibtex
@article{elbaz2026jitlora,
title={JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX},
author={Elbaz, E.},
year={2026},
url={https://github.com/eelbaz/jit-lora}
}
```
## License
MIT License. See [LICENSE](LICENSE) for details.