| | --- |
| | title: "JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX" |
| | emoji: "\u26a1" |
| | colorFrom: cyan |
| | colorTo: blue |
| | sdk: static |
| | pinned: false |
| | license: mit |
| | library_name: mlx |
| | tags: |
| | - lora |
| | - apple-silicon |
| | - mlx |
| | - fine-tuning |
| | - jit-training |
| | - real-time |
| | - on-device |
| | - research |
| | - paper |
| | language: |
| | - en |
| | --- |
| | |
| | # JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX |
| |
|
| | <p align="center"> |
| | <img src="figures/jarvis-interface.png" alt="J.A.R.V.I.S. — the voice-enabled AI assistant that rewrites its own weights mid-conversation" width="720"> |
| | </p> |
| |
|
| | **E. Elbaz** | Independent Research | March 2026 |
| |
|
| | [Paper (PDF)](paper.pdf) | [GitHub](https://github.com/eelbaz/jit-lora) |
| |
|
| | --- |
| |
|
| | ## Abstract |
| |
|
| | A system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd for gradient-based LoRA adaptation, the system — J.A.R.V.I.S., a voice-enabled AI assistant — updates its own weights after every response via background backpropagation. |
| |
|
| | ## Key Results |
| |
|
| | ### Results (35 real-world facts, Qwen3.5-2B-Base, 3 independent trials) |
| |
|
| | | Metric | Pooled | 95% Wilson CI | |
| | |---|---|---| |
| | | **Recall** | 61/105 (58.1%) | [48.5%, 67.1%] | |
| | | **General Knowledge** | 60/60 (100.0%) | [94.0%, 100.0%] | |
| |
|
| | **Training:** 180 steps, 69.6s ± 1.2s on M4 Max. **Zero catastrophic forgetting.** |
| |
|
| | ### Per-Category Recall |
| |
|
| | | Category | Score | 95% CI | |
| | |---|---|---| |
| | | Science | 3/3 (100%) | [43.8%, 100.0%] | |
| | | Sports | 16/18 (88.9%) | [67.2%, 96.9%] | |
| | | Awards | 18/21 (85.7%) | [65.4%, 95.0%] | |
| | | Weather/Natural Events | 12/15 (80.0%) | [54.8%, 93.0%] | |
| | | Technology/Business | 2/3 (66.7%) | [20.8%, 93.9%] | |
| | | Entertainment | 4/12 (33.3%) | [13.8%, 60.9%] | |
| | | Deaths/Obituaries | 6/33 (18.2%) | [8.6%, 34.4%] | |
| | | **Excl. Deaths** | **55/72 (76.4%)** | **[65.4%, 84.8%]** | |
| |
|
| | ### Cross-Domain Scaling (41 fictional facts, 10 interlocked domains) |
| |
|
| | | Category | Score | |
| | |---|---| |
| | | Direct Recall | 11/16 (69%) | |
| | | Generalization | 9/16 (56%) | |
| | | Cross-Domain Multi-Hop | 4/8 (50%) | |
| | | Negation/Boundary | 5/5 (100%) | |
| | | General Knowledge | 10/10 (100%) | |
| |
|
| | ## Critical Findings |
| |
|
| | 1. **Learning rate 10x higher than standard LoRA** (5e-4 vs 5e-5): JIT learning needs convergence in ~4 epochs, not thousands of steps. Gradient clipping (1.0) prevents instability. |
| |
|
| | 2. **≥33% regularization ratio eliminates catastrophic forgetting**: Below this threshold, the model overwrites core knowledge. At ≥33%, general knowledge is preserved at 100% (CI: [94.0%, 100.0%]). |
| |
|
| | 3. **mx.compile() hurts short training runs**: The ~20s first-trace overhead is not amortized in <200 steps. Per-step time is ~390ms without compilation. |
| |
|
| | 4. **Batching doesn't help on Apple Silicon**: Memory-bandwidth-limited, not compute-limited. Batch=8 takes 2.5s/step vs 0.42s/step for batch=1. |
| |
|
| | 5. **Structurally similar facts confuse small models**: Deaths/obituaries (18.2%) all follow "[Person] died on [Date]" pattern. The model learns the category but fabricates dates. Distinctive patterns (Sports, Awards) achieve 85-100%. |
| |
|
| | ## Architecture |
| |
|
| | The training engine is **pure MLX** — `nn.value_and_grad()` for real autograd, Adam optimizer, cosine LR with early stopping. LoRA adapters are injected in-place into the model, so `mlx_lm.stream_generate()` automatically uses the updated weights with no special handling. |
| |
|
| | ``` |
| | User → React Frontend → Express Proxy → Neural Daemon (FastAPI, :8766) |
| | ↓ |
| | MLX Inference with in-place LoRA adapter |
| | ↓ |
| | SSE Token Stream → Frontend → TTS |
| | ↓ |
| | [After response] MLX LoRA backprop (background) |
| | ↓ |
| | Updated adapter weights for next query |
| | ``` |
| |
|
| | ## Project Structure |
| |
|
| | ``` |
| | ├── src/ |
| | │ ├── mlx_lora_trainer.py # Training engine — LoRALinear, nn.value_and_grad, Adam, early stopping |
| | │ ├── neural_daemon.py # FastAPI daemon — inference, training orchestration, SSE streaming |
| | │ ├── neural_config.py # Hyperparameter configuration |
| | │ ├── neural_data.py # Training data manager — rolling + replay buffers |
| | │ ├── export_to_lms.py # GGUF export for LM Studio |
| | │ ├── ane_bridge_py.py # [Experimental] Python ctypes wrapper for ANE bridge |
| | │ ├── ane_lora_trainer.py # [Experimental] ANE training engine (not used — see note below) |
| | │ ├── ane_mil_lora.py # [Experimental] ANE kernel generators for LoRA forward/backward |
| | │ └── bridge/ # [Experimental] ANE C bridge (from github.com/maderix/ANE, MIT) |
| | ├── tests/ |
| | │ ├── test_daemon_e2e.py # Experiment 1 — 4 fictional facts |
| | │ ├── test_deep_e2e.py # Experiment 2 — 41 facts, 10 domains, 70 test cases |
| | │ ├── test_statistical_e2e.py # Experiment 3 — real-world facts, 3 trials, CIs |
| | │ ├── raw_facts_2026.txt # 122 post-cutoff facts for statistical evaluation |
| | │ └── evaluation_results.json # Machine-readable results |
| | ├── figures/ # Paper figures |
| | └── paper.pdf # Compiled paper |
| | ``` |
| |
|
| | ## Hardware |
| |
|
| | - Apple Silicon Mac (M-series) |
| | - Tested on M4 Max, 128GB unified memory |
| | - Models ≤2B should work on 16GB machines |
| |
|
| | ## Configuration |
| |
|
| | | Parameter | Value | Why | |
| | |---|---|---| |
| | | Learning rate | 5e-4 | 10x standard; converges in ~4 epochs | |
| | | LoRA rank | 32 | Capacity for ~35 facts per session | |
| | | LoRA targets | q, v, out, down_proj | Broad coverage (attention + MLP) | |
| | | Max epochs | 15 | Early stop fires sooner | |
| | | Regularization | ≥33% | Below this: catastrophic forgetting | |
| | | Batch size | 1 | Per-example steps; batching doesn't help | |
| | |
| | ## Setup |
| | |
| | ```bash |
| | git clone https://github.com/eelbaz/jit-lora.git |
| | cd jit-lora |
| | pip install mlx mlx-lm fastapi uvicorn requests numpy |
| | ``` |
| | |
| | ### Quick Validation |
| | |
| | ```bash |
| | # Verify MLX training engine (downloads Qwen2.5-0.5B, trains 5 steps, ~30s) |
| | python3 src/mlx_lora_trainer.py |
| | ``` |
| | |
| | ### Full Experiments |
| | |
| | ```bash |
| | # Terminal 1: Start daemon |
| | python3 src/neural_daemon.py |
| |
|
| | # Terminal 2: Activate model + run experiments |
| | curl -X POST http://localhost:8766/activate \ |
| | -H "Content-Type: application/json" \ |
| | -d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}' |
| | |
| | python3 tests/test_daemon_e2e.py # 4 facts, ~20s |
| | python3 tests/test_deep_e2e.py # 41 facts, ~121s |
| | python3 tests/test_statistical_e2e.py # 35+ facts, 3 trials, ~4 min |
| | ``` |
| | |
| | ## Note on ANE Code |
| | |
| | The `ane_*.py` files and `bridge/` directory are **experimental and not used for training**. The initial approach attempted to run LoRA kernels directly on Apple's Neural Engine via the private `AppleNeuralEngine.framework`. While the forward kernels compile and run, ANE produces IOSurface-backed tensors that are opaque to any autograd system — making gradient-based training impossible through ANE alone. |
| | |
| | All training in this project uses **MLX autograd on GPU**. The ANE code remains in the repo for a potential future hybrid inference path (see Section 8.2 of the paper), where ANE could accelerate LoRA forward passes during multi-agent inference while the GPU handles the base model. This path is speculative and has not been benchmarked. |
| | |
| | If you're interested in ANE internals, the bridge is based on [maderix/ANE](https://github.com/maderix/ANE) (MIT License) and requires macOS 15+ on Apple Silicon. Build with `cd src/bridge && make`. But this is **not required** to run any of the experiments or use the training system. |
| | |
| | ## Citation |
| | |
| | ```bibtex |
| | @article{elbaz2026jitlora, |
| | title={JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX}, |
| | author={Elbaz, E.}, |
| | year={2026}, |
| | url={https://github.com/eelbaz/jit-lora} |
| | } |
| | ``` |
| | |
| | ## License |
| | |
| | MIT License. See [LICENSE](LICENSE) for details. |
| | |