--- tags: - nanochat - crate - white-box - sparse-coding license: mit --- # crate-d12-base A **CRATE-α** (Coding RAte reduction TransformEr) language model trained with [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of [nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE white-box transformer architecture, SDPA/Flash Attention, and a self-supervised pseudo-labeling pipeline for domain-specific mid-training and fine-tuning. This checkpoint serves as the **baseline** for a series of experiments exploring self-supervised learning for mid-training and fine-tuning with the CRATE architecture. ## What is CRATE? CRATE is a **white-box transformer** -- unlike standard transformers where the architecture is heuristically designed, every layer of CRATE is mathematically derived from a principled optimization objective. Each layer alternates between two operations: 1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that performs gradient descent on the *coding rate reduction* objective. Q, K, and V share a single tied projection matrix, which means the attention operation is compressing token representations into low-dimensional subspaces. 2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that projects tokens into an overcomplete dictionary space (4× expansion), applies a sparse activation, and projects back. This encourages the model to learn sparse, interpretable representations at every layer. The net effect is that each forward pass solves a structured optimization problem: *compress* and *sparsify* the representation, layer by layer. The resulting internal representations are significantly more interpretable than those of standard transformers. ### Why ReLU Instead of Soft-Thresholding? The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding** as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`. This is the theoretically "correct" proximal operator for L1-regularized sparse coding, but it caused training instability at scale. The git repo has options to use either. CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling: | Change | Vanilla CRATE | CRATE-α | |--------|--------------|------------| | Dictionary | Complete (d × d) | Overcomplete (d × 4d) | | Activation | Soft-threshold | **ReLU** with learnable bias | | Sparse block | No residual | **Residual connection** | **ReLU** works better for scaling because: (a) it has a well-behaved gradient everywhere (no sign discontinuity), (b) the learnable threshold/bias allows each neuron to adaptively set its own sparsity level during training, and (c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks structurally similar to a standard MLP -- but it is *derived from* sparse coding principles rather than heuristically chosen, giving it a principled interpretation as dictionary learning. ## Evaluation: MMLU This model is evaluated against **MMLU** (Massive Multitask Language Understanding), a benchmark of 57 subjects spanning STEM, humanities, social sciences, and professional domains. MMLU tests the model's ability to answer multiple-choice questions requiring world knowledge and reasoning -- from abstract algebra and anatomy to US foreign policy and virology. It provides a broad signal for how much general knowledge the model has absorbed during pre-training. ## Baseline for Self-Supervised Experiments This checkpoint is the starting point for a multi-stage experimental pipeline: ``` crate-d12-base (this model) ├── → Code self-supervised (learn structural patterns from code) │ └── → Mid-training (adapt to chat/instruction format) │ └── → General self-supervised (broad knowledge via SmolTalk) │ └── → Math self-supervised (reasoning via GSM8K) │ └── → Chat SFT (final instruction tuning) ├── → Direct mid-training (comparison branch) └── → Other experimental forks ``` The self-supervised stages use **pseudo-labeling**: the model generates candidate responses for unlabeled prompts, scores them by confidence (average log-probability) or task reward, filters to the highest-quality candidates, and trains on the result. This loop can be iterated multiple times, progressively improving the model's own training signal. The hypothesis driving the pipeline order is that learning **code structure first** (syntax, nesting, logical flow) provides transferable structural priors that benefit subsequent natural language learning -- the model learns "systems of systems" thinking from code before encountering sentence structure and general knowledge. ## Model Details | Parameter | Value | |-----------|-------| | Architecture | CRATE-α | | Layers | 12 | | Hidden dim | 768 | | Attention heads | 6 | | Vocab size | 50304 | | Max sequence length | 1024 | | Window pattern | SSSL | | ODL expansion | 4× (overcomplete dictionary) | | Sparse activation | ReLU with learnable threshold | | Training step | 20,000 | | Validation BPB | 1.1131 | | Smooth train loss | 3.7495 | | Training time | 3.4 hours | | Run name | 4090-crate-a | | Batch size (tokens) | 65536 | ## Files - `model.safetensors` -- model weights in safetensors format - `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`) - `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding) - `token_bytes.pt` -- token byte mappings - `meta.json` -- full training metadata from the checkpoint ## Usage ```python from nanochat.checkpoint_manager import build_model model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval") ``` ## References - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE - Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-α - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU ## License This model is released under the **MIT License**. Built on: - [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025 - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023 - [CRATE-α](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA