---
tags:
- nanochat
- crate
- white-box
- sparse-coding
license: mit
---

# crate-d12-base

A **CRATE-α** (Coding RAte reduction TransformEr) language model trained with
[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.

This checkpoint serves as the **baseline** for a series of experiments exploring
self-supervised learning for mid-training and fine-tuning with the CRATE
architecture.

## What is CRATE?

CRATE is a **white-box transformer** -- unlike standard transformers where the
architecture is heuristically designed, every layer of CRATE is mathematically
derived from a principled optimization objective. Each layer alternates between
two operations:

1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
   performs gradient descent on the *coding rate reduction* objective. Q, K, and
   V share a single tied projection matrix, which means the attention operation
   is compressing token representations into low-dimensional subspaces.

2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
   projects tokens into an overcomplete dictionary space (4× expansion),
   applies a sparse activation, and projects back. This encourages the model to
   learn sparse, interpretable representations at every layer.

The net effect is that each forward pass solves a structured optimization
problem: *compress* and *sparsify* the representation, layer by layer. The
resulting internal representations are significantly more interpretable than
those of standard transformers.

### Why ReLU Instead of Soft-Thresholding?

The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
This is the theoretically "correct" proximal operator for L1-regularized sparse
coding, but it caused training instability at scale. The git repo has options to use either. 

CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:

| Change | Vanilla CRATE | CRATE-α |
|--------|--------------|------------|
| Dictionary | Complete (d × d) | Overcomplete (d × 4d) |
| Activation | Soft-threshold | **ReLU** with learnable bias |
| Sparse block | No residual | **Residual connection** |

**ReLU** works better for scaling because: (a) it has a well-behaved gradient
everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
each neuron to adaptively set its own sparsity level during training, and
(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
structurally similar to a standard MLP -- but it is *derived from* sparse coding
principles rather than heuristically chosen, giving it a principled
interpretation as dictionary learning.

## Evaluation: MMLU

This model is evaluated against **MMLU** (Massive Multitask Language
Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
sciences, and professional domains. MMLU tests the model's ability to answer
multiple-choice questions requiring world knowledge and reasoning -- from
abstract algebra and anatomy to US foreign policy and virology. It provides a
broad signal for how much general knowledge the model has absorbed during
pre-training.

## Baseline for Self-Supervised Experiments

This checkpoint is the starting point for a multi-stage experimental pipeline:

```
crate-d12-base (this model)
├── → Code self-supervised (learn structural patterns from code)
│      └── → Mid-training (adapt to chat/instruction format)
│             └── → General self-supervised (broad knowledge via SmolTalk)
│                    └── → Math self-supervised (reasoning via GSM8K)
│                           └── → Chat SFT (final instruction tuning)
├── → Direct mid-training (comparison branch)
└── → Other experimental forks
```

The self-supervised stages use **pseudo-labeling**: the model generates candidate
responses for unlabeled prompts, scores them by confidence (average log-probability)
or task reward, filters to the highest-quality candidates, and trains on the
result. This loop can be iterated multiple times, progressively improving the
model's own training signal.

The hypothesis driving the pipeline order is that learning **code structure
first** (syntax, nesting, logical flow) provides transferable structural priors
that benefit subsequent natural language learning -- the model learns "systems
of systems" thinking from code before encountering sentence structure and
general knowledge.

## Model Details

| Parameter | Value |
|-----------|-------|
| Architecture | CRATE-α |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 6 |
| Vocab size | 50304 |
| Max sequence length | 1024 |
| Window pattern | SSSL |
| ODL expansion | 4× (overcomplete dictionary) |
| Sparse activation | ReLU with learnable threshold |
| Training step | 20,000 |
| Validation BPB | 1.1131 |
| Smooth train loss | 3.7495 |
| Training time | 3.4 hours |
| Run name | 4090-crate-a |
| Batch size (tokens) | 65536 |

## Files

- `model.safetensors` -- model weights in safetensors format
- `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`)
- `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding)
- `token_bytes.pt` -- token byte mappings
- `meta.json` -- full training metadata from the checkpoint

## Usage

```python
from nanochat.checkpoint_manager import build_model

model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
```

## References

- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-α
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU

## License

This model is released under the **MIT License**.

Built on:
- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
- [CRATE-α](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA