throbbey
/

crate-d12-base

@@ -2,25 +2,113 @@
 tags:
 - nanochat
 - crate
 license: mit
 ---
 # crate-d12-base
-A CRATE (Coding RAte reduction TransformEr) language model
-trained with [nanochat](https://github.com/karpathy/nanochat).
 ## Model Details
 | Parameter | Value |
 |-----------|-------|
-| Architecture | CRATE |
 | Layers | 12 |
 | Hidden dim | 768 |
 | Attention heads | 6 |
 | Vocab size | 50304 |
 | Max sequence length | 1024 |
 | Window pattern | SSSL |
 | Training step | 20,000 |
 | Validation BPB | 1.1131 |
 | Smooth train loss | 3.7495 |
@@ -44,6 +132,12 @@ from nanochat.checkpoint_manager import build_model
 model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
 ```
 ## License
 This model is released under the **MIT License**.
@@ -51,4 +145,4 @@ This model is released under the **MIT License**.
 Built on:
 - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
 - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
-- [CRATE-alpha](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA

 tags:
 - nanochat
 - crate
+- white-box
+- sparse-coding
 license: mit
 ---
 # crate-d12-base
+A **CRATE-\u03b1** (Coding RAte reduction TransformEr) language model trained with
+[nanochat](https://github.com/karpathy/nanochat). This checkpoint serves as the
+**baseline** for a series of experiments exploring self-supervised learning for
+mid-training and fine-tuning with the CRATE architecture.
+## What is CRATE?
+CRATE is a **white-box transformer** -- unlike standard transformers where the
+architecture is heuristically designed, every layer of CRATE is mathematically
+derived from a principled optimization objective. Each layer alternates between
+two operations:
+1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
+   performs gradient descent on the *coding rate reduction* objective. Q, K, and
+   V share a single tied projection matrix, which means the attention operation
+   is compressing token representations into low-dimensional subspaces.
+2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
+   projects tokens into an overcomplete dictionary space (4\u00d7 expansion),
+   applies a sparse activation, and projects back. This encourages the model to
+   learn sparse, interpretable representations at every layer.
+The net effect is that each forward pass solves a structured optimization
+problem: *compress* and *sparsify* the representation, layer by layer. The
+resulting internal representations are significantly more interpretable than
+those of standard transformers.
+### Why ReLU Instead of Soft-Thresholding?
+The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
+as the sparse activation: \(S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)\).
+This is the theoretically "correct" proximal operator for L1-regularized sparse
+coding, but it caused training instability at scale.
+CRATE-\u03b1 (NeurIPS 2024) introduced three modifications that enable scaling:
+| Change | Vanilla CRATE | CRATE-\u03b1 |
+|--------|--------------|------------|
+| Dictionary | Complete (d \u00d7 d) | Overcomplete (d \u00d7 4d) |
+| Activation | Soft-threshold | **ReLU** with learnable bias |
+| Sparse block | No residual | **Residual connection** |
+**ReLU** works better for scaling because: (a) it has a well-behaved gradient
+everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
+each neuron to adaptively set its own sparsity level during training, and
+(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
+structurally similar to a standard MLP -- but it is *derived from* sparse coding
+principles rather than heuristically chosen, giving it a principled
+interpretation as dictionary learning.
+## Evaluation: MMLU
+This model is evaluated against **MMLU** (Massive Multitask Language
+Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
+sciences, and professional domains. MMLU tests the model's ability to answer
+multiple-choice questions requiring world knowledge and reasoning -- from
+abstract algebra and anatomy to US foreign policy and virology. It provides a
+broad signal for how much general knowledge the model has absorbed during
+pre-training.
+## Baseline for Self-Supervised Experiments
+This checkpoint is the starting point for a multi-stage experimental pipeline:
+```
+crate-d12-base (this model)
+\u251c\u2500\u2500 \u2192 Code self-supervised (learn structural patterns from code)
+\u2502      \u2514\u2500\u2500 \u2192 Mid-training (adapt to chat/instruction format)
+\u2502             \u2514\u2500\u2500 \u2192 General self-supervised (broad knowledge via SmolTalk)
+\u2502                    \u2514\u2500\u2500 \u2192 Math self-supervised (reasoning via GSM8K)
+\u2502                           \u2514\u2500\u2500 \u2192 Chat SFT (final instruction tuning)
+\u251c\u2500\u2500 \u2192 Direct mid-training (comparison branch)
+\u2514\u2500\u2500 \u2192 Other experimental forks
+```
+The self-supervised stages use **pseudo-labeling**: the model generates candidate
+responses for unlabeled prompts, scores them by confidence (average log-probability)
+or task reward, filters to the highest-quality candidates, and trains on the
+result. This loop can be iterated multiple times, progressively improving the
+model's own training signal.
+The hypothesis driving the pipeline order is that learning **code structure
+first** (syntax, nesting, logical flow) provides transferable structural priors
+that benefit subsequent natural language learning -- the model learns "systems
+of systems" thinking from code before encountering sentence structure and
+general knowledge.
 ## Model Details
 | Parameter | Value |
 |-----------|-------|
+| Architecture | CRATE-\u03b1 |
 | Layers | 12 |
 | Hidden dim | 768 |
 | Attention heads | 6 |
 | Vocab size | 50304 |
 | Max sequence length | 1024 |
 | Window pattern | SSSL |
+| ODL expansion | 4\u00d7 (overcomplete dictionary) |
+| Sparse activation | ReLU with learnable threshold |
 | Training step | 20,000 |
 | Validation BPB | 1.1131 |
 | Smooth train loss | 3.7495 |
 model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
 ```
+## References
+- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
+- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-\u03b1
+- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
 ## License
 This model is released under the **MIT License**.
 Built on:
 - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
 - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
+- [CRATE-\u03b1](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA