# The Edge-Device LoRA Guide In standard Large Language Models, when you finish training a LoRA (Low-Rank Adaptation), you usually "merge" the LoRA matrices directly into the base weights so you only have to load one model. If you do that with MiniLM, **you completely destroy the 1.58-bit compression.** ### The Problem MiniLM's internal weights are ternary (`-1, 0, 1`). A LoRA introduces high-precision `FP16` adapter weights. If you attempt to merge them (`W = W_ternary + A*B`), the resulting mathematical matrix is no longer ternary. It becomes a massive floating-point matrix, destroying the memory efficiency required for edge devices. ### The Solution: "Side-Car" LoRAs To maintain the blazing fast, ultra-low memory footprint on an edge device (like a smartwatch or IoT sensor), MiniLM uses a "Side-Car" architecture. 1. **The Base Model (6.0 MB):** Stays completely frozen in 1.58-bit ternary precision. 2. **The LoRA Adapters (~1 MB):** Two tiny FP16 matrices (`A` and `B`) that sit next to the base layer. 3. **The Math:** During inference, the input flows through the 1.58-bit base layer (using fast integer math). Simultaneously, it flows through the tiny FP16 LoRA layer. The two outputs are simply added together at the end. ## Why this is the Holy Grail for Edge Devices Because the LoRAs are kept isolated as tiny ~1MB files, you can build an operating system for an edge device where you only keep **one** 3.9MB base model permanently loaded in RAM, and "hot-swap" tiny LoRAs on the fly depending on what app the user opens! - User speaks a Smart Home command -> Hot-load `lora_smarthome.pt`. - User asks to text their mom -> Drop the smart home LoRA, load `lora_sms.pt`. - User reviews a restaurant -> Load `lora_sentiment.pt`. ## Training Your Own LoRA MiniLM ships with `train_lora_dynamic.py` and an interactive Streamlit UI to let you build your own LoRAs in minutes. You simply need a JSON file of Input/Output pairs. Because the model is so small, highly constrained datasets (like Information Extraction, JSON formatting, or strict Classification) perform spectacularly. Example JSON (`my_dataset.json`): ```json [ {"input": "Tell Alex I'll be 5 minutes late", "output": "{\"contact\": \"Alex\", \"message\": \"I'll be 5 minutes late\"}"} ] ``` Run the training script: ```bash python3 train_lora_dynamic.py my_dataset.json 300 my_custom_lora.pt ``` In just 300 steps (which takes about 90 seconds on a Mac M3), the tiny 1MB side-car LoRA will perfectly memorize the extraction pattern, while the base 3.9MB model remains completely untouched!