File size: 8,158 Bytes

---
title: "JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX"
emoji: "\u26a1"
colorFrom: cyan
colorTo: blue
sdk: static
pinned: false
license: mit
library_name: mlx
tags:
  - lora
  - apple-silicon
  - mlx
  - fine-tuning
  - jit-training
  - real-time
  - on-device
  - research
  - paper
language:
  - en
---

# JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX

<p align="center">
  <img src="figures/jarvis-interface.png" alt="J.A.R.V.I.S. — the voice-enabled AI assistant that rewrites its own weights mid-conversation" width="720">
</p>

**E. Elbaz** | Independent Research | March 2026

[Paper (PDF)](paper.pdf) | [GitHub](https://github.com/eelbaz/jit-lora)

---

## Abstract

A system for just-in-time (JIT) LoRA training that modifies a running language model's weights mid-conversation on consumer Apple Silicon hardware. Using MLX-native autograd for gradient-based LoRA adaptation, the system — J.A.R.V.I.S., a voice-enabled AI assistant — updates its own weights after every response via background backpropagation.

## Key Results

### Results (35 real-world facts, Qwen3.5-2B-Base, 3 independent trials)

| Metric | Pooled | 95% Wilson CI |
|---|---|---|
| **Recall** | 61/105 (58.1%) | [48.5%, 67.1%] |
| **General Knowledge** | 60/60 (100.0%) | [94.0%, 100.0%] |

**Training:** 180 steps, 69.6s ± 1.2s on M4 Max. **Zero catastrophic forgetting.**

### Per-Category Recall

| Category | Score | 95% CI |
|---|---|---|
| Science | 3/3 (100%) | [43.8%, 100.0%] |
| Sports | 16/18 (88.9%) | [67.2%, 96.9%] |
| Awards | 18/21 (85.7%) | [65.4%, 95.0%] |
| Weather/Natural Events | 12/15 (80.0%) | [54.8%, 93.0%] |
| Technology/Business | 2/3 (66.7%) | [20.8%, 93.9%] |
| Entertainment | 4/12 (33.3%) | [13.8%, 60.9%] |
| Deaths/Obituaries | 6/33 (18.2%) | [8.6%, 34.4%] |
| **Excl. Deaths** | **55/72 (76.4%)** | **[65.4%, 84.8%]** |

### Cross-Domain Scaling (41 fictional facts, 10 interlocked domains)

| Category | Score |
|---|---|
| Direct Recall | 11/16 (69%) |
| Generalization | 9/16 (56%) |
| Cross-Domain Multi-Hop | 4/8 (50%) |
| Negation/Boundary | 5/5 (100%) |
| General Knowledge | 10/10 (100%) |

## Critical Findings

1. **Learning rate 10x higher than standard LoRA** (5e-4 vs 5e-5): JIT learning needs convergence in ~4 epochs, not thousands of steps. Gradient clipping (1.0) prevents instability.

2. **≥33% regularization ratio eliminates catastrophic forgetting**: Below this threshold, the model overwrites core knowledge. At ≥33%, general knowledge is preserved at 100% (CI: [94.0%, 100.0%]).

3. **mx.compile() hurts short training runs**: The ~20s first-trace overhead is not amortized in <200 steps. Per-step time is ~390ms without compilation.

4. **Batching doesn't help on Apple Silicon**: Memory-bandwidth-limited, not compute-limited. Batch=8 takes 2.5s/step vs 0.42s/step for batch=1.

5. **Structurally similar facts confuse small models**: Deaths/obituaries (18.2%) all follow "[Person] died on [Date]" pattern. The model learns the category but fabricates dates. Distinctive patterns (Sports, Awards) achieve 85-100%.

## Architecture

The training engine is **pure MLX** — `nn.value_and_grad()` for real autograd, Adam optimizer, cosine LR with early stopping. LoRA adapters are injected in-place into the model, so `mlx_lm.stream_generate()` automatically uses the updated weights with no special handling.

```
User → React Frontend → Express Proxy → Neural Daemon (FastAPI, :8766)
                                              ↓
                                    MLX Inference with in-place LoRA adapter
                                              ↓
                                    SSE Token Stream → Frontend → TTS
                                              ↓
                               [After response] MLX LoRA backprop (background)
                                              ↓
                                    Updated adapter weights for next query
```

## Project Structure

```
├── src/
│   ├── mlx_lora_trainer.py       # Training engine — LoRALinear, nn.value_and_grad, Adam, early stopping
│   ├── neural_daemon.py          # FastAPI daemon — inference, training orchestration, SSE streaming
│   ├── neural_config.py          # Hyperparameter configuration
│   ├── neural_data.py            # Training data manager — rolling + replay buffers
│   ├── export_to_lms.py          # GGUF export for LM Studio
│   ├── ane_bridge_py.py          # [Experimental] Python ctypes wrapper for ANE bridge
│   ├── ane_lora_trainer.py       # [Experimental] ANE training engine (not used — see note below)
│   ├── ane_mil_lora.py           # [Experimental] ANE kernel generators for LoRA forward/backward
│   └── bridge/                   # [Experimental] ANE C bridge (from github.com/maderix/ANE, MIT)
├── tests/
│   ├── test_daemon_e2e.py        # Experiment 1 — 4 fictional facts
│   ├── test_deep_e2e.py          # Experiment 2 — 41 facts, 10 domains, 70 test cases
│   ├── test_statistical_e2e.py   # Experiment 3 — real-world facts, 3 trials, CIs
│   ├── raw_facts_2026.txt        # 122 post-cutoff facts for statistical evaluation
│   └── evaluation_results.json   # Machine-readable results
├── figures/                      # Paper figures
└── paper.pdf                     # Compiled paper
```

## Hardware

- Apple Silicon Mac (M-series)
- Tested on M4 Max, 128GB unified memory
- Models ≤2B should work on 16GB machines

## Configuration

| Parameter | Value | Why |
|---|---|---|
| Learning rate | 5e-4 | 10x standard; converges in ~4 epochs |
| LoRA rank | 32 | Capacity for ~35 facts per session |
| LoRA targets | q, v, out, down_proj | Broad coverage (attention + MLP) |
| Max epochs | 15 | Early stop fires sooner |
| Regularization | ≥33% | Below this: catastrophic forgetting |
| Batch size | 1 | Per-example steps; batching doesn't help |

## Setup

```bash
git clone https://github.com/eelbaz/jit-lora.git
cd jit-lora
pip install mlx mlx-lm fastapi uvicorn requests numpy
```

### Quick Validation

```bash
# Verify MLX training engine (downloads Qwen2.5-0.5B, trains 5 steps, ~30s)
python3 src/mlx_lora_trainer.py
```

### Full Experiments

```bash
# Terminal 1: Start daemon
python3 src/neural_daemon.py

# Terminal 2: Activate model + run experiments
curl -X POST http://localhost:8766/activate \
  -H "Content-Type: application/json" \
  -d '{"hf_repo":"Qwen/Qwen3.5-2B-Base"}'

python3 tests/test_daemon_e2e.py         # 4 facts, ~20s
python3 tests/test_deep_e2e.py           # 41 facts, ~121s
python3 tests/test_statistical_e2e.py    # 35+ facts, 3 trials, ~4 min
```

## Note on ANE Code

The `ane_*.py` files and `bridge/` directory are **experimental and not used for training**. The initial approach attempted to run LoRA kernels directly on Apple's Neural Engine via the private `AppleNeuralEngine.framework`. While the forward kernels compile and run, ANE produces IOSurface-backed tensors that are opaque to any autograd system — making gradient-based training impossible through ANE alone.

All training in this project uses **MLX autograd on GPU**. The ANE code remains in the repo for a potential future hybrid inference path (see Section 8.2 of the paper), where ANE could accelerate LoRA forward passes during multi-agent inference while the GPU handles the base model. This path is speculative and has not been benchmarked.

If you're interested in ANE internals, the bridge is based on [maderix/ANE](https://github.com/maderix/ANE) (MIT License) and requires macOS 15+ on Apple Silicon. Build with `cd src/bridge && make`. But this is **not required** to run any of the experiments or use the training system.

## Citation

```bibtex
@article{elbaz2026jitlora,
  title={JIT LoRA: Real-Time Conversational Knowledge Injection on Apple Silicon via MLX},
  author={Elbaz, E.},
  year={2026},
  url={https://github.com/eelbaz/jit-lora}
}
```

## License

MIT License. See [LICENSE](LICENSE) for details.