Distillix 125M: The Resurrected BitNet

"We didn't train it. We healed it."

This model is a scientific anomaly. It is a 1.58-bit (ternary) LLM that suffered total weight collapse during training (weights → 0.00). Instead of retraining from scratch, we resurrected it using Geometric Engineering based on the Latent Metric Model (LMM) theory.

The Crisis

During BitNet training, standard weight decay (L2 regularization) created a catastrophic failure:

  • The Zero Trap: Weights were pushed toward zero by the optimizer
  • Ternary Quantization: Once weights fell below the quantization threshold, they snapped to 0
  • Sticky Death: Gradients couldn't escape the zero bucket—the neurons died permanently
  • Result: 99.1% of MLP weights were zero. The model was brain-dead.

The Resurrection

Instead of discarding months of training, we applied manifold physics to bring the model back:

Phase Method Result
1. Time Travel Found model_500steps.pt (last checkpoint before total collapse) MLP Std: 0.008 (dying but alive)
2. Geometric Engineering Wasserstein loss + SVD denoising Task loss: 1.15 (learned!) but 88% sparse
3. Inflation Pushed weights FROM zero TO ±0.02 Sparsity: 88% → 21.5%
4. Diagnosis Discovered chat-format training data Model speaks when prompted correctly!

The Physics

  1. Polarized Optimizer: Replaced weight decay with a "double-well potential" that REPELS weights from zero
  2. Three-Peaks Potential: Enforced clustering at {-S, 0, +S} instead of just "away from zero"
  3. Wasserstein Loss (Syrota et al.): Aligned the GLOBAL weight distribution to the BitNet lattice using optimal transport
  4. SVD Denoising (Whiteley et al.): Projected weights onto principal components to remove noise while preserving structure
  5. Manifold Inflation: Added redundancy back after over-compression to restore robustness

Model Stats

Metric Value
Architecture Llama-style Transformer
Parameters 125M
Hidden Dim 768
Layers 12
Heads 12 (Query) / 4 (KV)
Quantization BitNet b1.58 (Ternary: {-1, 0, +1})
Final Sparsity 21.5%
Weight Distribution 29% (-S) / 42% (0) / 29% (+S)
MLP Std 0.021 (exactly at target!)
Task Loss ~0.2

Usage

The model was resurrected on chat-format data, so it expects this prompt structure:

prompt = """### User:
Write a Python function to calculate fibonacci numbers.

### Assistant:
"""

Example Output

Here is a Python function to calculate Fibonacci numbers using recursion:

def fibonacci(n):
    if n <= 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

You can also use an iterative approach for better performance:

def fibonacci_iter(n):
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

Loading the Model

import torch
from huggingface_hub import hf_hub_download

# Download the resurrected model
path = hf_hub_download(
    repo_id="rileyseaburg/distillix",
    filename="inflation/inflation-2000.pt"
)

# Load checkpoint
ckpt = torch.load(path, map_location='cpu')
state_dict = ckpt['model_state_dict']

# Load into your model architecture
model.load_state_dict(state_dict)

Files

File Description
model_500steps.pt The "Time Machine" - last healthy checkpoint before collapse
geometric/geometric-*.pt Checkpoints from Wasserstein+SVD training
inflation/inflation-2000.pt THE FINAL MODEL - fully resurrected

The Journey (TL;DR)

Step 500:   MLP Std = 0.008  (Dying)
Step 2000:  MLP Std = 0.000  (Dead - 99% zeros)
            ↓
    [GEOMETRIC ENGINEERING]
            ↓
Geometric:  Task Loss = 1.15  (Learned! But 88% sparse)
            ↓
    [INFLATION]
            ↓
Final:      MLP Std = 0.021  (Alive!)
            Sparsity = 21.5% (Dense!)
            Distribution = 29/42/29 (Balanced!)
            
OUTPUT: "Here is a Python function to calculate Fibonacci..."

Why This Matters

  1. Dead models can be resurrected - You don't have to throw away collapsed checkpoints
  2. Manifold geometry is real - The LMM theory predicted this would work, and it did
  3. BitNet needs special optimizers - Standard weight decay is lethal for ternary networks
  4. Physics > Brute Force - We healed the model with math, not more compute

Theoretical Foundation

  • Whiteley et al. (2025): "Statistical Exploration of the Manifold Hypothesis" - Theorem 1 proves signal lives in principal components
  • Syrota et al. (2025): "Metric Identifiability in Latent Models" - Theorem 4.7 proves metric structure is recoverable from distribution

Credits

  • Engineering & Resurrection: Riley Seaburg
  • Theoretical Framework: Whiteley, Gray, Rubin-Delanchy (LMM), Syrota et al. (Metric Identifiability)
  • Original BitNet: Microsoft Research

License

MIT


"The model was dead. We didn't retrain it. We performed surgery on its soul."

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Spaces using rileyseaburg/distillix 4