| # The Edge-Device LoRA Guide |
|
|
| In standard Large Language Models, when you finish training a LoRA (Low-Rank Adaptation), you usually "merge" the LoRA matrices directly into the base weights so you only have to load one model. |
|
|
| If you do that with MiniLM, **you completely destroy the 1.58-bit compression.** |
|
|
| ### The Problem |
| MiniLM's internal weights are ternary (`-1, 0, 1`). A LoRA introduces high-precision `FP16` adapter weights. If you attempt to merge them (`W = W_ternary + A*B`), the resulting mathematical matrix is no longer ternary. It becomes a massive floating-point matrix, destroying the memory efficiency required for edge devices. |
|
|
| ### The Solution: "Side-Car" LoRAs |
| To maintain the blazing fast, ultra-low memory footprint on an edge device (like a smartwatch or IoT sensor), MiniLM uses a "Side-Car" architecture. |
|
|
| 1. **The Base Model (6.0 MB):** Stays completely frozen in 1.58-bit ternary precision. |
| 2. **The LoRA Adapters (~1 MB):** Two tiny FP16 matrices (`A` and `B`) that sit next to the base layer. |
| 3. **The Math:** During inference, the input flows through the 1.58-bit base layer (using fast integer math). Simultaneously, it flows through the tiny FP16 LoRA layer. The two outputs are simply added together at the end. |
|
|
| ## Why this is the Holy Grail for Edge Devices |
| Because the LoRAs are kept isolated as tiny ~1MB files, you can build an operating system for an edge device where you only keep **one** 3.9MB base model permanently loaded in RAM, and "hot-swap" tiny LoRAs on the fly depending on what app the user opens! |
|
|
| - User speaks a Smart Home command -> Hot-load `lora_smarthome.pt`. |
| - User asks to text their mom -> Drop the smart home LoRA, load `lora_sms.pt`. |
| - User reviews a restaurant -> Load `lora_sentiment.pt`. |
|
|
| ## Training Your Own LoRA |
| MiniLM ships with `train_lora_dynamic.py` and an interactive Streamlit UI to let you build your own LoRAs in minutes. |
|
|
| You simply need a JSON file of Input/Output pairs. Because the model is so small, highly constrained datasets (like Information Extraction, JSON formatting, or strict Classification) perform spectacularly. |
|
|
| Example JSON (`my_dataset.json`): |
| ```json |
| [ |
| {"input": "Tell Alex I'll be 5 minutes late", "output": "{\"contact\": \"Alex\", \"message\": \"I'll be 5 minutes late\"}"} |
| ] |
| ``` |
|
|
| Run the training script: |
| ```bash |
| python3 train_lora_dynamic.py my_dataset.json 300 my_custom_lora.pt |
| ``` |
|
|
| In just 300 steps (which takes about 90 seconds on a Mac M3), the tiny 1MB side-car LoRA will perfectly memorize the extraction pattern, while the base 3.9MB model remains completely untouched! |
|
|