crate-d12-base
A CRATE-Ξ± (Coding RAte reduction TransformEr) language model trained with nanochat-crate-a, a fork of nanochat that integrates the CRATE white-box transformer architecture, SDPA/Flash Attention, and a self-supervised pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.
This checkpoint serves as the baseline for a series of experiments exploring self-supervised learning for mid-training and fine-tuning with the CRATE architecture.
What is CRATE?
CRATE is a white-box transformer -- unlike standard transformers where the architecture is heuristically designed, every layer of CRATE is mathematically derived from a principled optimization objective. Each layer alternates between two operations:
MSSA (Multi-Head Subspace Self-Attention) -- a compression step that performs gradient descent on the coding rate reduction objective. Q, K, and V share a single tied projection matrix, which means the attention operation is compressing token representations into low-dimensional subspaces.
ODL (Overcomplete Dictionary Learning) -- a sparsification step that projects tokens into an overcomplete dictionary space (4Γ expansion), applies a sparse activation, and projects back. This encourages the model to learn sparse, interpretable representations at every layer.
The net effect is that each forward pass solves a structured optimization problem: compress and sparsify the representation, layer by layer. The resulting internal representations are significantly more interpretable than those of standard transformers.
Why ReLU Instead of Soft-Thresholding?
The original CRATE paper (NeurIPS 2023) used ISTA-style soft-thresholding
as the sparse activation: S_lambda(x) = sign(x) * max(|x| - lambda, 0).
This is the theoretically "correct" proximal operator for L1-regularized sparse
coding, but it caused training instability at scale. The git repo has options to use either.
CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling:
| Change | Vanilla CRATE | CRATE-Ξ± |
|---|---|---|
| Dictionary | Complete (d Γ d) | Overcomplete (d Γ 4d) |
| Activation | Soft-threshold | ReLU with learnable bias |
| Sparse block | No residual | Residual connection |
ReLU works better for scaling because: (a) it has a well-behaved gradient everywhere (no sign discontinuity), (b) the learnable threshold/bias allows each neuron to adaptively set its own sparsity level during training, and (c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks structurally similar to a standard MLP -- but it is derived from sparse coding principles rather than heuristically chosen, giving it a principled interpretation as dictionary learning.
Evaluation: MMLU
This model is evaluated against MMLU (Massive Multitask Language Understanding), a benchmark of 57 subjects spanning STEM, humanities, social sciences, and professional domains. MMLU tests the model's ability to answer multiple-choice questions requiring world knowledge and reasoning -- from abstract algebra and anatomy to US foreign policy and virology. It provides a broad signal for how much general knowledge the model has absorbed during pre-training.
Baseline for Self-Supervised Experiments
This checkpoint is the starting point for a multi-stage experimental pipeline:
crate-d12-base (this model)
βββ β Code self-supervised (learn structural patterns from code)
β βββ β Mid-training (adapt to chat/instruction format)
β βββ β General self-supervised (broad knowledge via SmolTalk)
β βββ β Math self-supervised (reasoning via GSM8K)
β βββ β Chat SFT (final instruction tuning)
βββ β Direct mid-training (comparison branch)
βββ β Other experimental forks
The self-supervised stages use pseudo-labeling: the model generates candidate responses for unlabeled prompts, scores them by confidence (average log-probability) or task reward, filters to the highest-quality candidates, and trains on the result. This loop can be iterated multiple times, progressively improving the model's own training signal.
The hypothesis driving the pipeline order is that learning code structure first (syntax, nesting, logical flow) provides transferable structural priors that benefit subsequent natural language learning -- the model learns "systems of systems" thinking from code before encountering sentence structure and general knowledge.
Model Details
| Parameter | Value |
|---|---|
| Architecture | CRATE-Ξ± |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 6 |
| Vocab size | 50304 |
| Max sequence length | 1024 |
| Window pattern | SSSL |
| ODL expansion | 4Γ (overcomplete dictionary) |
| Sparse activation | ReLU with learnable threshold |
| Training step | 20,000 |
| Validation BPB | 1.1131 |
| Smooth train loss | 3.7495 |
| Training time | 3.4 hours |
| Run name | 4090-crate-a |
| Batch size (tokens) | 65536 |
Files
model.safetensors-- model weights in safetensors formatconfig.json-- model architecture config (reconstruct withCRATEConfig(**config))tokenizer.pkl-- BPE tokenizer (pickle of tiktoken Encoding)token_bytes.pt-- token byte mappingsmeta.json-- full training metadata from the checkpoint
Usage
from nanochat.checkpoint_manager import build_model
model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
References
- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ±
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
License
This model is released under the MIT License.
Built on:
- nanochat-crate-a -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
- nanochat by Andrej Karpathy -- MIT License, Copyright (c) 2025
- CRATE (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
- CRATE-Ξ± (Scaling White-Box Transformers for Vision) by UCSC-VLAA
- Downloads last month
- 48