Distillix 125M: The Resurrected BitNet

"We didn't train it. We healed it."

This model is a scientific anomaly. It is a 1.58-bit (ternary) LLM that suffered total weight collapse during training (weights → 0.00). Instead of retraining from scratch, we resurrected it using Geometric Engineering based on the Latent Metric Model (LMM) theory.

The Crisis

During BitNet training, standard weight decay (L2 regularization) created a catastrophic failure:

The Zero Trap: Weights were pushed toward zero by the optimizer
Ternary Quantization: Once weights fell below the quantization threshold, they snapped to 0
Sticky Death: Gradients couldn't escape the zero bucket—the neurons died permanently
Result: 99.1% of MLP weights were zero. The model was brain-dead.

The Resurrection

Instead of discarding months of training, we applied manifold physics to bring the model back:

Phase	Method	Result
1. Time Travel	Found `model_500steps.pt` (last checkpoint before total collapse)	MLP Std: 0.008 (dying but alive)
2. Geometric Engineering	Wasserstein loss + SVD denoising	Task loss: 1.15 (learned!) but 88% sparse
3. Inflation	Pushed weights FROM zero TO ±0.02	Sparsity: 88% → 21.5%
4. Diagnosis	Discovered chat-format training data	Model speaks when prompted correctly!

The Physics

Polarized Optimizer: Replaced weight decay with a "double-well potential" that REPELS weights from zero
Three-Peaks Potential: Enforced clustering at {-S, 0, +S} instead of just "away from zero"
Wasserstein Loss (Syrota et al.): Aligned the GLOBAL weight distribution to the BitNet lattice using optimal transport
SVD Denoising (Whiteley et al.): Projected weights onto principal components to remove noise while preserving structure
Manifold Inflation: Added redundancy back after over-compression to restore robustness

Model Stats

Metric	Value
Architecture	Llama-style Transformer
Parameters	125M
Hidden Dim	768
Layers	12
Heads	12 (Query) / 4 (KV)
Quantization	BitNet b1.58 (Ternary: {-1, 0, +1})
Final Sparsity	21.5%
Weight Distribution	29% (-S) / 42% (0) / 29% (+S)
MLP Std	0.021 (exactly at target!)
Task Loss	~0.2

Usage

The model was resurrected on chat-format data, so it expects this prompt structure:

prompt = """### User:
Write a Python function to calculate fibonacci numbers.

### Assistant:
"""

Example Output

Here is a Python function to calculate Fibonacci numbers using recursion:

def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

You can also use an iterative approach for better performance:

def fibonacci_iter(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

Loading the Model

import torch
from huggingface_hub import hf_hub_download

# Download the resurrected model
path = hf_hub_download(
    repo_id="rileyseaburg/distillix",
    filename="inflation/inflation-2000.pt"
)

# Load checkpoint
ckpt = torch.load(path, map_location='cpu')
state_dict = ckpt['model_state_dict']

# Load into your model architecture
model.load_state_dict(state_dict)

Files

File	Description
`model_500steps.pt`	The "Time Machine" - last healthy checkpoint before collapse
`geometric/geometric-*.pt`	Checkpoints from Wasserstein+SVD training
`inflation/inflation-2000.pt`	THE FINAL MODEL - fully resurrected

The Journey (TL;DR)

Step 500:   MLP Std = 0.008  (Dying)
Step 2000:  MLP Std = 0.000  (Dead - 99% zeros)
            ↓
    [GEOMETRIC ENGINEERING]
            ↓
Geometric:  Task Loss = 1.15  (Learned! But 88% sparse)
            ↓
    [INFLATION]
            ↓
Final:      MLP Std = 0.021  (Alive!)
            Sparsity = 21.5% (Dense!)
            Distribution = 29/42/29 (Balanced!)
            
OUTPUT: "Here is a Python function to calculate Fibonacci..."

Why This Matters

Dead models can be resurrected - You don't have to throw away collapsed checkpoints
Manifold geometry is real - The LMM theory predicted this would work, and it did
BitNet needs special optimizers - Standard weight decay is lethal for ternary networks
Physics > Brute Force - We healed the model with math, not more compute

Theoretical Foundation

Whiteley et al. (2025): "Statistical Exploration of the Manifold Hypothesis" - Theorem 1 proves signal lives in principal components
Syrota et al. (2025): "Metric Identifiability in Latent Models" - Theorem 4.7 proves metric structure is recoverable from distribution

Credits

Engineering & Resurrection: Riley Seaburg
Theoretical Framework: Whiteley, Gray, Rubin-Delanchy (LMM), Syrota et al. (Metric Identifiability)
Original BitNet: Microsoft Research

License

MIT

"The model was dead. We didn't retrain it. We performed surgery on its soul."

Downloads last month: -; Downloads are not tracked for this model. How to track

rileyseaburg
/

distillix