Distillix 125M: The Resurrected BitNet
"We didn't train it. We healed it."
This model is a scientific anomaly. It is a 1.58-bit (ternary) LLM that suffered total weight collapse during training (weights → 0.00). Instead of retraining from scratch, we resurrected it using Geometric Engineering based on the Latent Metric Model (LMM) theory.
The Crisis
During BitNet training, standard weight decay (L2 regularization) created a catastrophic failure:
- The Zero Trap: Weights were pushed toward zero by the optimizer
- Ternary Quantization: Once weights fell below the quantization threshold, they snapped to 0
- Sticky Death: Gradients couldn't escape the zero bucket—the neurons died permanently
- Result: 99.1% of MLP weights were zero. The model was brain-dead.
The Resurrection
Instead of discarding months of training, we applied manifold physics to bring the model back:
| Phase | Method | Result |
|---|---|---|
| 1. Time Travel | Found model_500steps.pt (last checkpoint before total collapse) |
MLP Std: 0.008 (dying but alive) |
| 2. Geometric Engineering | Wasserstein loss + SVD denoising | Task loss: 1.15 (learned!) but 88% sparse |
| 3. Inflation | Pushed weights FROM zero TO ±0.02 | Sparsity: 88% → 21.5% |
| 4. Diagnosis | Discovered chat-format training data | Model speaks when prompted correctly! |
The Physics
- Polarized Optimizer: Replaced weight decay with a "double-well potential" that REPELS weights from zero
- Three-Peaks Potential: Enforced clustering at {-S, 0, +S} instead of just "away from zero"
- Wasserstein Loss (Syrota et al.): Aligned the GLOBAL weight distribution to the BitNet lattice using optimal transport
- SVD Denoising (Whiteley et al.): Projected weights onto principal components to remove noise while preserving structure
- Manifold Inflation: Added redundancy back after over-compression to restore robustness
Model Stats
| Metric | Value |
|---|---|
| Architecture | Llama-style Transformer |
| Parameters | 125M |
| Hidden Dim | 768 |
| Layers | 12 |
| Heads | 12 (Query) / 4 (KV) |
| Quantization | BitNet b1.58 (Ternary: {-1, 0, +1}) |
| Final Sparsity | 21.5% |
| Weight Distribution | 29% (-S) / 42% (0) / 29% (+S) |
| MLP Std | 0.021 (exactly at target!) |
| Task Loss | ~0.2 |
Usage
The model was resurrected on chat-format data, so it expects this prompt structure:
prompt = """### User:
Write a Python function to calculate fibonacci numbers.
### Assistant:
"""
Example Output
Here is a Python function to calculate Fibonacci numbers using recursion:
def fibonacci(n):
if n <= 0:
return 0
elif n == 1:
return 1
else:
return fibonacci(n-1) + fibonacci(n-2)
You can also use an iterative approach for better performance:
def fibonacci_iter(n):
a, b = 0, 1
for _ in range(n):
a, b = b, a + b
return a
Loading the Model
import torch
from huggingface_hub import hf_hub_download
# Download the resurrected model
path = hf_hub_download(
repo_id="rileyseaburg/distillix",
filename="inflation/inflation-2000.pt"
)
# Load checkpoint
ckpt = torch.load(path, map_location='cpu')
state_dict = ckpt['model_state_dict']
# Load into your model architecture
model.load_state_dict(state_dict)
Files
| File | Description |
|---|---|
model_500steps.pt |
The "Time Machine" - last healthy checkpoint before collapse |
geometric/geometric-*.pt |
Checkpoints from Wasserstein+SVD training |
inflation/inflation-2000.pt |
THE FINAL MODEL - fully resurrected |
The Journey (TL;DR)
Step 500: MLP Std = 0.008 (Dying)
Step 2000: MLP Std = 0.000 (Dead - 99% zeros)
↓
[GEOMETRIC ENGINEERING]
↓
Geometric: Task Loss = 1.15 (Learned! But 88% sparse)
↓
[INFLATION]
↓
Final: MLP Std = 0.021 (Alive!)
Sparsity = 21.5% (Dense!)
Distribution = 29/42/29 (Balanced!)
OUTPUT: "Here is a Python function to calculate Fibonacci..."
Why This Matters
- Dead models can be resurrected - You don't have to throw away collapsed checkpoints
- Manifold geometry is real - The LMM theory predicted this would work, and it did
- BitNet needs special optimizers - Standard weight decay is lethal for ternary networks
- Physics > Brute Force - We healed the model with math, not more compute
Theoretical Foundation
- Whiteley et al. (2025): "Statistical Exploration of the Manifold Hypothesis" - Theorem 1 proves signal lives in principal components
- Syrota et al. (2025): "Metric Identifiability in Latent Models" - Theorem 4.7 proves metric structure is recoverable from distribution
Credits
- Engineering & Resurrection: Riley Seaburg
- Theoretical Framework: Whiteley, Gray, Rubin-Delanchy (LMM), Syrota et al. (Metric Identifiability)
- Original BitNet: Microsoft Research
License
MIT
"The model was dead. We didn't retrain it. We performed surgery on its soul."