File size: 6,714 Bytes
ffc32d3 1f52a7e ffc32d3 bd7abe2 1f52a7e bd7abe2 1f52a7e bd7abe2 1f52a7e 1e35a07 1f52a7e bd7abe2 1f52a7e bd7abe2 1f52a7e bd7abe2 1f52a7e bd7abe2 1f52a7e ffc32d3 bd7abe2 ffc32d3 bd7abe2 1f52a7e ffc32d3 caa22cd 1f52a7e bd7abe2 1f52a7e caa22cd bd7abe2 caa22cd bd7abe2 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | ---
tags:
- nanochat
- crate
- white-box
- sparse-coding
license: mit
---
# crate-d12-base
A **CRATE-Ξ±** (Coding RAte reduction TransformEr) language model trained with
[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.
This checkpoint serves as the **baseline** for a series of experiments exploring
self-supervised learning for mid-training and fine-tuning with the CRATE
architecture.
## What is CRATE?
CRATE is a **white-box transformer** -- unlike standard transformers where the
architecture is heuristically designed, every layer of CRATE is mathematically
derived from a principled optimization objective. Each layer alternates between
two operations:
1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
performs gradient descent on the *coding rate reduction* objective. Q, K, and
V share a single tied projection matrix, which means the attention operation
is compressing token representations into low-dimensional subspaces.
2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
projects tokens into an overcomplete dictionary space (4Γ expansion),
applies a sparse activation, and projects back. This encourages the model to
learn sparse, interpretable representations at every layer.
The net effect is that each forward pass solves a structured optimization
problem: *compress* and *sparsify* the representation, layer by layer. The
resulting internal representations are significantly more interpretable than
those of standard transformers.
### Why ReLU Instead of Soft-Thresholding?
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
This is the theoretically "correct" proximal operator for L1-regularized sparse
coding, but it caused training instability at scale. The git repo has options to use either.
CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling:
| Change | Vanilla CRATE | CRATE-Ξ± |
|--------|--------------|------------|
| Dictionary | Complete (d Γ d) | Overcomplete (d Γ 4d) |
| Activation | Soft-threshold | **ReLU** with learnable bias |
| Sparse block | No residual | **Residual connection** |
**ReLU** works better for scaling because: (a) it has a well-behaved gradient
everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
each neuron to adaptively set its own sparsity level during training, and
(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
structurally similar to a standard MLP -- but it is *derived from* sparse coding
principles rather than heuristically chosen, giving it a principled
interpretation as dictionary learning.
## Evaluation: MMLU
This model is evaluated against **MMLU** (Massive Multitask Language
Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
sciences, and professional domains. MMLU tests the model's ability to answer
multiple-choice questions requiring world knowledge and reasoning -- from
abstract algebra and anatomy to US foreign policy and virology. It provides a
broad signal for how much general knowledge the model has absorbed during
pre-training.
## Baseline for Self-Supervised Experiments
This checkpoint is the starting point for a multi-stage experimental pipeline:
```
crate-d12-base (this model)
βββ β Code self-supervised (learn structural patterns from code)
β βββ β Mid-training (adapt to chat/instruction format)
β βββ β General self-supervised (broad knowledge via SmolTalk)
β βββ β Math self-supervised (reasoning via GSM8K)
β βββ β Chat SFT (final instruction tuning)
βββ β Direct mid-training (comparison branch)
βββ β Other experimental forks
```
The self-supervised stages use **pseudo-labeling**: the model generates candidate
responses for unlabeled prompts, scores them by confidence (average log-probability)
or task reward, filters to the highest-quality candidates, and trains on the
result. This loop can be iterated multiple times, progressively improving the
model's own training signal.
The hypothesis driving the pipeline order is that learning **code structure
first** (syntax, nesting, logical flow) provides transferable structural priors
that benefit subsequent natural language learning -- the model learns "systems
of systems" thinking from code before encountering sentence structure and
general knowledge.
## Model Details
| Parameter | Value |
|-----------|-------|
| Architecture | CRATE-Ξ± |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 6 |
| Vocab size | 50304 |
| Max sequence length | 1024 |
| Window pattern | SSSL |
| ODL expansion | 4Γ (overcomplete dictionary) |
| Sparse activation | ReLU with learnable threshold |
| Training step | 20,000 |
| Validation BPB | 1.1131 |
| Smooth train loss | 3.7495 |
| Training time | 3.4 hours |
| Run name | 4090-crate-a |
| Batch size (tokens) | 65536 |
## Files
- `model.safetensors` -- model weights in safetensors format
- `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`)
- `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding)
- `token_bytes.pt` -- token byte mappings
- `meta.json` -- full training metadata from the checkpoint
## Usage
```python
from nanochat.checkpoint_manager import build_model
model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
```
## References
- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ±
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
## License
This model is released under the **MIT License**.
Built on:
- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
- [CRATE-Ξ±](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA
|