The Edge-Device LoRA Guide

In standard Large Language Models, when you finish training a LoRA (Low-Rank Adaptation), you usually "merge" the LoRA matrices directly into the base weights so you only have to load one model.

If you do that with MiniLM, you completely destroy the 1.58-bit compression.

The Problem

MiniLM's internal weights are ternary (-1, 0, 1). A LoRA introduces high-precision FP16 adapter weights. If you attempt to merge them (W = W_ternary + A*B), the resulting mathematical matrix is no longer ternary. It becomes a massive floating-point matrix, destroying the memory efficiency required for edge devices.

The Solution: "Side-Car" LoRAs

To maintain the blazing fast, ultra-low memory footprint on an edge device (like a smartwatch or IoT sensor), MiniLM uses a "Side-Car" architecture.

The Base Model (6.0 MB): Stays completely frozen in 1.58-bit ternary precision.
The LoRA Adapters (~1 MB): Two tiny FP16 matrices (A and B) that sit next to the base layer.
The Math: During inference, the input flows through the 1.58-bit base layer (using fast integer math). Simultaneously, it flows through the tiny FP16 LoRA layer. The two outputs are simply added together at the end.

Why this is the Holy Grail for Edge Devices

Because the LoRAs are kept isolated as tiny ~1MB files, you can build an operating system for an edge device where you only keep one 3.9MB base model permanently loaded in RAM, and "hot-swap" tiny LoRAs on the fly depending on what app the user opens!

User speaks a Smart Home command -> Hot-load lora_smarthome.pt.
User asks to text their mom -> Drop the smart home LoRA, load lora_sms.pt.
User reviews a restaurant -> Load lora_sentiment.pt.

Training Your Own LoRA

MiniLM ships with train_lora_dynamic.py and an interactive Streamlit UI to let you build your own LoRAs in minutes.

You simply need a JSON file of Input/Output pairs. Because the model is so small, highly constrained datasets (like Information Extraction, JSON formatting, or strict Classification) perform spectacularly.

Example JSON (my_dataset.json):

[
    {"input": "Tell Alex I'll be 5 minutes late", "output": "{\"contact\": \"Alex\", \"message\": \"I'll be 5 minutes late\"}"}
]

Run the training script:

python3 train_lora_dynamic.py my_dataset.json 300 my_custom_lora.pt

In just 300 steps (which takes about 90 seconds on a Mac M3), the tiny 1MB side-car LoRA will perfectly memorize the extraction pattern, while the base 3.9MB model remains completely untouched!