0sparsh2
/

MiniLM

knowledge-distillation

small-language-model

Model card Files Files and versions

MiniLM / LORA_GUIDE.md

0sparsh2's picture

Upload LORA_GUIDE.md with huggingface_hub

b0ae145 verified 22 days ago

|

History Blame Contribute Delete

2.59 kB

	# The Edge-Device LoRA Guide

	In standard Large Language Models, when you finish training a LoRA (Low-Rank Adaptation), you usually "merge" the LoRA matrices directly into the base weights so you only have to load one model.

	If you do that with MiniLM, you completely destroy the 1.58-bit compression.

	### The Problem
	MiniLM's internal weights are ternary (`-1, 0, 1`). A LoRA introduces high-precision `FP16` adapter weights. If you attempt to merge them (`W = W_ternary + A*B`), the resulting mathematical matrix is no longer ternary. It becomes a massive floating-point matrix, destroying the memory efficiency required for edge devices.

	### The Solution: "Side-Car" LoRAs
	To maintain the blazing fast, ultra-low memory footprint on an edge device (like a smartwatch or IoT sensor), MiniLM uses a "Side-Car" architecture.

	1. The Base Model (6.0 MB): Stays completely frozen in 1.58-bit ternary precision.
	2. The LoRA Adapters (~1 MB): Two tiny FP16 matrices (`A` and `B`) that sit next to the base layer.
	3. The Math: During inference, the input flows through the 1.58-bit base layer (using fast integer math). Simultaneously, it flows through the tiny FP16 LoRA layer. The two outputs are simply added together at the end.

	## Why this is the Holy Grail for Edge Devices
	Because the LoRAs are kept isolated as tiny ~1MB files, you can build an operating system for an edge device where you only keep one 3.9MB base model permanently loaded in RAM, and "hot-swap" tiny LoRAs on the fly depending on what app the user opens!

	- User speaks a Smart Home command -> Hot-load `lora_smarthome.pt`.
	- User asks to text their mom -> Drop the smart home LoRA, load `lora_sms.pt`.
	- User reviews a restaurant -> Load `lora_sentiment.pt`.

	## Training Your Own LoRA
	MiniLM ships with `train_lora_dynamic.py` and an interactive Streamlit UI to let you build your own LoRAs in minutes.

	You simply need a JSON file of Input/Output pairs. Because the model is so small, highly constrained datasets (like Information Extraction, JSON formatting, or strict Classification) perform spectacularly.

	Example JSON (`my_dataset.json`):
	```json
	[
	{"input": "Tell Alex I'll be 5 minutes late", "output": "{\"contact\": \"Alex\", \"message\": \"I'll be 5 minutes late\"}"}
	]
	```

	Run the training script:
	```bash
	python3 train_lora_dynamic.py my_dataset.json 300 my_custom_lora.pt
	```

	In just 300 steps (which takes about 90 seconds on a Mac M3), the tiny 1MB side-car LoRA will perfectly memorize the extraction pattern, while the base 3.9MB model remains completely untouched!