| | --- |
| | tags: |
| | - nanochat |
| | - crate |
| | - white-box |
| | - sparse-coding |
| | license: mit |
| | --- |
| | |
| | # crate-d12-base |
| |
|
| | A **CRATE-Ξ±** (Coding RAte reduction TransformEr) language model trained with |
| | [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of |
| | [nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE |
| | white-box transformer architecture, SDPA/Flash Attention, and a self-supervised |
| | pseudo-labeling pipeline for domain-specific mid-training and fine-tuning. |
| |
|
| | This checkpoint serves as the **baseline** for a series of experiments exploring |
| | self-supervised learning for mid-training and fine-tuning with the CRATE |
| | architecture. |
| |
|
| | ## What is CRATE? |
| |
|
| | CRATE is a **white-box transformer** -- unlike standard transformers where the |
| | architecture is heuristically designed, every layer of CRATE is mathematically |
| | derived from a principled optimization objective. Each layer alternates between |
| | two operations: |
| |
|
| | 1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that |
| | performs gradient descent on the *coding rate reduction* objective. Q, K, and |
| | V share a single tied projection matrix, which means the attention operation |
| | is compressing token representations into low-dimensional subspaces. |
| |
|
| | 2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that |
| | projects tokens into an overcomplete dictionary space (4Γ expansion), |
| | applies a sparse activation, and projects back. This encourages the model to |
| | learn sparse, interpretable representations at every layer. |
| |
|
| | The net effect is that each forward pass solves a structured optimization |
| | problem: *compress* and *sparsify* the representation, layer by layer. The |
| | resulting internal representations are significantly more interpretable than |
| | those of standard transformers. |
| |
|
| | ### Why ReLU Instead of Soft-Thresholding? |
| |
|
| | The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding** |
| | as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`. |
| | This is the theoretically "correct" proximal operator for L1-regularized sparse |
| | coding, but it caused training instability at scale. The git repo has options to use either. |
| |
|
| | CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling: |
| |
|
| | | Change | Vanilla CRATE | CRATE-Ξ± | |
| | |--------|--------------|------------| |
| | | Dictionary | Complete (d Γ d) | Overcomplete (d Γ 4d) | |
| | | Activation | Soft-threshold | **ReLU** with learnable bias | |
| | | Sparse block | No residual | **Residual connection** | |
| |
|
| | **ReLU** works better for scaling because: (a) it has a well-behaved gradient |
| | everywhere (no sign discontinuity), (b) the learnable threshold/bias allows |
| | each neuron to adaptively set its own sparsity level during training, and |
| | (c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks |
| | structurally similar to a standard MLP -- but it is *derived from* sparse coding |
| | principles rather than heuristically chosen, giving it a principled |
| | interpretation as dictionary learning. |
| |
|
| | ## Evaluation: MMLU |
| |
|
| | This model is evaluated against **MMLU** (Massive Multitask Language |
| | Understanding), a benchmark of 57 subjects spanning STEM, humanities, social |
| | sciences, and professional domains. MMLU tests the model's ability to answer |
| | multiple-choice questions requiring world knowledge and reasoning -- from |
| | abstract algebra and anatomy to US foreign policy and virology. It provides a |
| | broad signal for how much general knowledge the model has absorbed during |
| | pre-training. |
| |
|
| | ## Baseline for Self-Supervised Experiments |
| |
|
| | This checkpoint is the starting point for a multi-stage experimental pipeline: |
| |
|
| | ``` |
| | crate-d12-base (this model) |
| | βββ β Code self-supervised (learn structural patterns from code) |
| | β βββ β Mid-training (adapt to chat/instruction format) |
| | β βββ β General self-supervised (broad knowledge via SmolTalk) |
| | β βββ β Math self-supervised (reasoning via GSM8K) |
| | β βββ β Chat SFT (final instruction tuning) |
| | βββ β Direct mid-training (comparison branch) |
| | βββ β Other experimental forks |
| | ``` |
| |
|
| | The self-supervised stages use **pseudo-labeling**: the model generates candidate |
| | responses for unlabeled prompts, scores them by confidence (average log-probability) |
| | or task reward, filters to the highest-quality candidates, and trains on the |
| | result. This loop can be iterated multiple times, progressively improving the |
| | model's own training signal. |
| |
|
| | The hypothesis driving the pipeline order is that learning **code structure |
| | first** (syntax, nesting, logical flow) provides transferable structural priors |
| | that benefit subsequent natural language learning -- the model learns "systems |
| | of systems" thinking from code before encountering sentence structure and |
| | general knowledge. |
| |
|
| | ## Model Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Architecture | CRATE-Ξ± | |
| | | Layers | 12 | |
| | | Hidden dim | 768 | |
| | | Attention heads | 6 | |
| | | Vocab size | 50304 | |
| | | Max sequence length | 1024 | |
| | | Window pattern | SSSL | |
| | | ODL expansion | 4Γ (overcomplete dictionary) | |
| | | Sparse activation | ReLU with learnable threshold | |
| | | Training step | 20,000 | |
| | | Validation BPB | 1.1131 | |
| | | Smooth train loss | 3.7495 | |
| | | Training time | 3.4 hours | |
| | | Run name | 4090-crate-a | |
| | | Batch size (tokens) | 65536 | |
| |
|
| | ## Files |
| |
|
| | - `model.safetensors` -- model weights in safetensors format |
| | - `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`) |
| | - `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding) |
| | - `token_bytes.pt` -- token byte mappings |
| | - `meta.json` -- full training metadata from the checkpoint |
| |
|
| | ## Usage |
| |
|
| | ```python |
| | from nanochat.checkpoint_manager import build_model |
| | |
| | model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval") |
| | ``` |
| |
|
| | ## References |
| |
|
| | - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE |
| | - Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ± |
| | - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU |
| |
|
| | ## License |
| |
|
| | This model is released under the **MIT License**. |
| |
|
| | Built on: |
| | - [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention |
| | - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025 |
| | - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023 |
| | - [CRATE-Ξ±](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA |
| |
|