Comprehensive model card: CRATE architecture, MMLU, ReLU scaling, experiment baseline
Browse files
README.md
CHANGED
|
@@ -2,25 +2,113 @@
|
|
| 2 |
tags:
|
| 3 |
- nanochat
|
| 4 |
- crate
|
|
|
|
|
|
|
| 5 |
license: mit
|
| 6 |
---
|
| 7 |
|
| 8 |
# crate-d12-base
|
| 9 |
|
| 10 |
-
A CRATE (Coding RAte reduction TransformEr) language model
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
## Model Details
|
| 14 |
|
| 15 |
| Parameter | Value |
|
| 16 |
|-----------|-------|
|
| 17 |
-
| Architecture | CRATE |
|
| 18 |
| Layers | 12 |
|
| 19 |
| Hidden dim | 768 |
|
| 20 |
| Attention heads | 6 |
|
| 21 |
| Vocab size | 50304 |
|
| 22 |
| Max sequence length | 1024 |
|
| 23 |
| Window pattern | SSSL |
|
|
|
|
|
|
|
| 24 |
| Training step | 20,000 |
|
| 25 |
| Validation BPB | 1.1131 |
|
| 26 |
| Smooth train loss | 3.7495 |
|
|
@@ -44,6 +132,12 @@ from nanochat.checkpoint_manager import build_model
|
|
| 44 |
model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
|
| 45 |
```
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
## License
|
| 48 |
|
| 49 |
This model is released under the **MIT License**.
|
|
@@ -51,4 +145,4 @@ This model is released under the **MIT License**.
|
|
| 51 |
Built on:
|
| 52 |
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
|
| 53 |
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
|
| 54 |
-
- [CRATE-
|
|
|
|
| 2 |
tags:
|
| 3 |
- nanochat
|
| 4 |
- crate
|
| 5 |
+
- white-box
|
| 6 |
+
- sparse-coding
|
| 7 |
license: mit
|
| 8 |
---
|
| 9 |
|
| 10 |
# crate-d12-base
|
| 11 |
|
| 12 |
+
A **CRATE-\u03b1** (Coding RAte reduction TransformEr) language model trained with
|
| 13 |
+
[nanochat](https://github.com/karpathy/nanochat). This checkpoint serves as the
|
| 14 |
+
**baseline** for a series of experiments exploring self-supervised learning for
|
| 15 |
+
mid-training and fine-tuning with the CRATE architecture.
|
| 16 |
+
|
| 17 |
+
## What is CRATE?
|
| 18 |
+
|
| 19 |
+
CRATE is a **white-box transformer** -- unlike standard transformers where the
|
| 20 |
+
architecture is heuristically designed, every layer of CRATE is mathematically
|
| 21 |
+
derived from a principled optimization objective. Each layer alternates between
|
| 22 |
+
two operations:
|
| 23 |
+
|
| 24 |
+
1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
|
| 25 |
+
performs gradient descent on the *coding rate reduction* objective. Q, K, and
|
| 26 |
+
V share a single tied projection matrix, which means the attention operation
|
| 27 |
+
is compressing token representations into low-dimensional subspaces.
|
| 28 |
+
|
| 29 |
+
2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
|
| 30 |
+
projects tokens into an overcomplete dictionary space (4\u00d7 expansion),
|
| 31 |
+
applies a sparse activation, and projects back. This encourages the model to
|
| 32 |
+
learn sparse, interpretable representations at every layer.
|
| 33 |
+
|
| 34 |
+
The net effect is that each forward pass solves a structured optimization
|
| 35 |
+
problem: *compress* and *sparsify* the representation, layer by layer. The
|
| 36 |
+
resulting internal representations are significantly more interpretable than
|
| 37 |
+
those of standard transformers.
|
| 38 |
+
|
| 39 |
+
### Why ReLU Instead of Soft-Thresholding?
|
| 40 |
+
|
| 41 |
+
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
|
| 42 |
+
as the sparse activation: \(S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)\).
|
| 43 |
+
This is the theoretically "correct" proximal operator for L1-regularized sparse
|
| 44 |
+
coding, but it caused training instability at scale.
|
| 45 |
+
|
| 46 |
+
CRATE-\u03b1 (NeurIPS 2024) introduced three modifications that enable scaling:
|
| 47 |
+
|
| 48 |
+
| Change | Vanilla CRATE | CRATE-\u03b1 |
|
| 49 |
+
|--------|--------------|------------|
|
| 50 |
+
| Dictionary | Complete (d \u00d7 d) | Overcomplete (d \u00d7 4d) |
|
| 51 |
+
| Activation | Soft-threshold | **ReLU** with learnable bias |
|
| 52 |
+
| Sparse block | No residual | **Residual connection** |
|
| 53 |
+
|
| 54 |
+
**ReLU** works better for scaling because: (a) it has a well-behaved gradient
|
| 55 |
+
everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
|
| 56 |
+
each neuron to adaptively set its own sparsity level during training, and
|
| 57 |
+
(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
|
| 58 |
+
structurally similar to a standard MLP -- but it is *derived from* sparse coding
|
| 59 |
+
principles rather than heuristically chosen, giving it a principled
|
| 60 |
+
interpretation as dictionary learning.
|
| 61 |
+
|
| 62 |
+
## Evaluation: MMLU
|
| 63 |
+
|
| 64 |
+
This model is evaluated against **MMLU** (Massive Multitask Language
|
| 65 |
+
Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
|
| 66 |
+
sciences, and professional domains. MMLU tests the model's ability to answer
|
| 67 |
+
multiple-choice questions requiring world knowledge and reasoning -- from
|
| 68 |
+
abstract algebra and anatomy to US foreign policy and virology. It provides a
|
| 69 |
+
broad signal for how much general knowledge the model has absorbed during
|
| 70 |
+
pre-training.
|
| 71 |
+
|
| 72 |
+
## Baseline for Self-Supervised Experiments
|
| 73 |
+
|
| 74 |
+
This checkpoint is the starting point for a multi-stage experimental pipeline:
|
| 75 |
+
|
| 76 |
+
```
|
| 77 |
+
crate-d12-base (this model)
|
| 78 |
+
\u251c\u2500\u2500 \u2192 Code self-supervised (learn structural patterns from code)
|
| 79 |
+
\u2502 \u2514\u2500\u2500 \u2192 Mid-training (adapt to chat/instruction format)
|
| 80 |
+
\u2502 \u2514\u2500\u2500 \u2192 General self-supervised (broad knowledge via SmolTalk)
|
| 81 |
+
\u2502 \u2514\u2500\u2500 \u2192 Math self-supervised (reasoning via GSM8K)
|
| 82 |
+
\u2502 \u2514\u2500\u2500 \u2192 Chat SFT (final instruction tuning)
|
| 83 |
+
\u251c\u2500\u2500 \u2192 Direct mid-training (comparison branch)
|
| 84 |
+
\u2514\u2500\u2500 \u2192 Other experimental forks
|
| 85 |
+
```
|
| 86 |
+
|
| 87 |
+
The self-supervised stages use **pseudo-labeling**: the model generates candidate
|
| 88 |
+
responses for unlabeled prompts, scores them by confidence (average log-probability)
|
| 89 |
+
or task reward, filters to the highest-quality candidates, and trains on the
|
| 90 |
+
result. This loop can be iterated multiple times, progressively improving the
|
| 91 |
+
model's own training signal.
|
| 92 |
+
|
| 93 |
+
The hypothesis driving the pipeline order is that learning **code structure
|
| 94 |
+
first** (syntax, nesting, logical flow) provides transferable structural priors
|
| 95 |
+
that benefit subsequent natural language learning -- the model learns "systems
|
| 96 |
+
of systems" thinking from code before encountering sentence structure and
|
| 97 |
+
general knowledge.
|
| 98 |
|
| 99 |
## Model Details
|
| 100 |
|
| 101 |
| Parameter | Value |
|
| 102 |
|-----------|-------|
|
| 103 |
+
| Architecture | CRATE-\u03b1 |
|
| 104 |
| Layers | 12 |
|
| 105 |
| Hidden dim | 768 |
|
| 106 |
| Attention heads | 6 |
|
| 107 |
| Vocab size | 50304 |
|
| 108 |
| Max sequence length | 1024 |
|
| 109 |
| Window pattern | SSSL |
|
| 110 |
+
| ODL expansion | 4\u00d7 (overcomplete dictionary) |
|
| 111 |
+
| Sparse activation | ReLU with learnable threshold |
|
| 112 |
| Training step | 20,000 |
|
| 113 |
| Validation BPB | 1.1131 |
|
| 114 |
| Smooth train loss | 3.7495 |
|
|
|
|
| 132 |
model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
|
| 133 |
```
|
| 134 |
|
| 135 |
+
## References
|
| 136 |
+
|
| 137 |
+
- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
|
| 138 |
+
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-\u03b1
|
| 139 |
+
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
|
| 140 |
+
|
| 141 |
## License
|
| 142 |
|
| 143 |
This model is released under the **MIT License**.
|
|
|
|
| 145 |
Built on:
|
| 146 |
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
|
| 147 |
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
|
| 148 |
+
- [CRATE-\u03b1](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA
|