MiniLM / ARCHITECTURE.md

Upload ARCHITECTURE.md with huggingface_hub

f4f6f1c verified 22 days ago

4.48 kB

	# MiniLM: The 1.58-bit Architecture Deep Dive

	MiniLM is not just a quantized model—it is a completely custom neural network architecture built from the ground up to natively operate in 1.58-bit (Ternary) precision.

	By heavily compressing the internal mathematics of the Transformer, we achieved a deep 12-layer model that fits entirely into 6.00 MB of RAM, making it small enough to run on microcontrollers, smartwatches, and embedded IoT devices.

	This document serves as a masterclass on exactly how MiniLM was engineered.

	---

	## 1. The Core Innovation: 1.58-bit Ternary Weights

	In standard Large Language Models (like Llama 3 or GPT-4), the neural network's memory (its "weights") are stored as 16-bit floating-point numbers (`FP16`). A single layer can easily exceed gigabytes of RAM.

	MiniLM uses the BitNet 1.58b architecture paradigm. We discard floating-point precision entirely. Every single internal weight in MiniLM's Linear layers is constrained to exactly three possible values:
	* `-1`
	* `0`
	* `1`

	Because $\log_2(3) \approx 1.58$, we call this a 1.58-bit model.

	### Why is this revolutionary?
	When you multiply a number by `-1`, `0`, or `1`, you aren't actually doing complex matrix multiplication. You are simply doing Addition and Subtraction.
	If a weight is `1`, you add the input. If it is `-1`, you subtract the input. If it is `0`, you ignore it.

	This means MiniLM replaces the most computationally expensive operation in AI (Floating Point Matrix Multiplication) with ultra-fast, hardware-efficient Integer Addition.

	---

	## 2. How We Trained It: The Straight-Through Estimator (STE)

	You cannot train a ternary neural network using standard backpropagation, because the rounding function (clamping a value to -1, 0, or 1) has a derivative of zero almost everywhere. The gradient would instantly "die" and the model would never learn.

	To solve this, we implemented a custom Straight-Through Estimator (STE):
	1. Forward Pass: We take the high-precision latent weights, calculate their mean, divide by a scaling factor (`beta`), and aggressively round them to `[-1, 0, 1]`. The forward calculations are performed using these ternary weights.
	2. Backward Pass: When the loss calculates the error gradient, we pretend the rounding step never happened. We pass the gradient straight through to the high-precision latent weights.

	This allows the high-precision weights to slowly adjust over time, until their rounded ternary counterparts snap into the optimal configuration.

	---

	## 3. Breaking the Depth Barrier: Weight Tying

	Our initial 4-layer model fit into 3.93 MB and showed promising results, but 4 layers is incredibly shallow for an LLM to form coherent, long-form thoughts.

	To solve this, we implemented Weight Tying.
	In a standard LLM, the `Embedding Layer` (which turns words into vectors) and the `Output Head` (which turns vectors back into words) are two separate, massive matrices.

	Because we used a 32,000 token vocabulary, these two matrices were consuming over 85% of our total parameter budget!

	By mathematically tying the weights together (`model.head.weight = model.embedding.weight`), we instantly freed up 8 Million parameters. We re-invested this exact parameter budget to triple the depth of the neural network from 4 layers to 12 layers, drastically improving output coherence without increasing the file size by a single byte.

	---

	## 4. Knowledge Distillation

	Training a 1.58-bit model from absolute scratch using Next-Token Prediction is notoriously difficult and requires massive amounts of data and compute (100k+ steps).

	Instead, we used Knowledge Distillation.
	1. We loaded `HuggingFaceTB/SmolLM-135M-Instruct` as a "Teacher" model.
	2. We forced MiniLM to use the exact same tokenizer as SmolLM.
	3. For every prompt, the Teacher model output a rich probability distribution (logits) of what the next word should be.
	4. We used `KLDivLoss` (KL Divergence) to force MiniLM to perfectly mimic the Teacher's probability distribution.

	By learning from the Teacher's rich understanding of language rather than just a sparse one-hot encoded dataset, MiniLM converged in just 3,000 steps on the TinyStories dataset!

	---

	## Conclusion
	MiniLM is a testament to the future of Edge AI. By combining Ternary Quantization, Weight Tying, and Knowledge Distillation, we have packed the structural depth of a 12-layer Transformer into a file size smaller than an MP3 song.