throbbey commited on
Commit
1f52a7e
·
verified ·
1 Parent(s): caa22cd

Comprehensive model card: CRATE architecture, MMLU, ReLU scaling, experiment baseline

Browse files
Files changed (1) hide show
  1. README.md +98 -4
README.md CHANGED
@@ -2,25 +2,113 @@
2
  tags:
3
  - nanochat
4
  - crate
 
 
5
  license: mit
6
  ---
7
 
8
  # crate-d12-base
9
 
10
- A CRATE (Coding RAte reduction TransformEr) language model
11
- trained with [nanochat](https://github.com/karpathy/nanochat).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  ## Model Details
14
 
15
  | Parameter | Value |
16
  |-----------|-------|
17
- | Architecture | CRATE |
18
  | Layers | 12 |
19
  | Hidden dim | 768 |
20
  | Attention heads | 6 |
21
  | Vocab size | 50304 |
22
  | Max sequence length | 1024 |
23
  | Window pattern | SSSL |
 
 
24
  | Training step | 20,000 |
25
  | Validation BPB | 1.1131 |
26
  | Smooth train loss | 3.7495 |
@@ -44,6 +132,12 @@ from nanochat.checkpoint_manager import build_model
44
  model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
45
  ```
46
 
 
 
 
 
 
 
47
  ## License
48
 
49
  This model is released under the **MIT License**.
@@ -51,4 +145,4 @@ This model is released under the **MIT License**.
51
  Built on:
52
  - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
53
  - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
54
- - [CRATE-alpha](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA
 
2
  tags:
3
  - nanochat
4
  - crate
5
+ - white-box
6
+ - sparse-coding
7
  license: mit
8
  ---
9
 
10
  # crate-d12-base
11
 
12
+ A **CRATE-\u03b1** (Coding RAte reduction TransformEr) language model trained with
13
+ [nanochat](https://github.com/karpathy/nanochat). This checkpoint serves as the
14
+ **baseline** for a series of experiments exploring self-supervised learning for
15
+ mid-training and fine-tuning with the CRATE architecture.
16
+
17
+ ## What is CRATE?
18
+
19
+ CRATE is a **white-box transformer** -- unlike standard transformers where the
20
+ architecture is heuristically designed, every layer of CRATE is mathematically
21
+ derived from a principled optimization objective. Each layer alternates between
22
+ two operations:
23
+
24
+ 1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
25
+ performs gradient descent on the *coding rate reduction* objective. Q, K, and
26
+ V share a single tied projection matrix, which means the attention operation
27
+ is compressing token representations into low-dimensional subspaces.
28
+
29
+ 2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
30
+ projects tokens into an overcomplete dictionary space (4\u00d7 expansion),
31
+ applies a sparse activation, and projects back. This encourages the model to
32
+ learn sparse, interpretable representations at every layer.
33
+
34
+ The net effect is that each forward pass solves a structured optimization
35
+ problem: *compress* and *sparsify* the representation, layer by layer. The
36
+ resulting internal representations are significantly more interpretable than
37
+ those of standard transformers.
38
+
39
+ ### Why ReLU Instead of Soft-Thresholding?
40
+
41
+ The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
42
+ as the sparse activation: \(S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)\).
43
+ This is the theoretically "correct" proximal operator for L1-regularized sparse
44
+ coding, but it caused training instability at scale.
45
+
46
+ CRATE-\u03b1 (NeurIPS 2024) introduced three modifications that enable scaling:
47
+
48
+ | Change | Vanilla CRATE | CRATE-\u03b1 |
49
+ |--------|--------------|------------|
50
+ | Dictionary | Complete (d \u00d7 d) | Overcomplete (d \u00d7 4d) |
51
+ | Activation | Soft-threshold | **ReLU** with learnable bias |
52
+ | Sparse block | No residual | **Residual connection** |
53
+
54
+ **ReLU** works better for scaling because: (a) it has a well-behaved gradient
55
+ everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
56
+ each neuron to adaptively set its own sparsity level during training, and
57
+ (c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
58
+ structurally similar to a standard MLP -- but it is *derived from* sparse coding
59
+ principles rather than heuristically chosen, giving it a principled
60
+ interpretation as dictionary learning.
61
+
62
+ ## Evaluation: MMLU
63
+
64
+ This model is evaluated against **MMLU** (Massive Multitask Language
65
+ Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
66
+ sciences, and professional domains. MMLU tests the model's ability to answer
67
+ multiple-choice questions requiring world knowledge and reasoning -- from
68
+ abstract algebra and anatomy to US foreign policy and virology. It provides a
69
+ broad signal for how much general knowledge the model has absorbed during
70
+ pre-training.
71
+
72
+ ## Baseline for Self-Supervised Experiments
73
+
74
+ This checkpoint is the starting point for a multi-stage experimental pipeline:
75
+
76
+ ```
77
+ crate-d12-base (this model)
78
+ \u251c\u2500\u2500 \u2192 Code self-supervised (learn structural patterns from code)
79
+ \u2502 \u2514\u2500\u2500 \u2192 Mid-training (adapt to chat/instruction format)
80
+ \u2502 \u2514\u2500\u2500 \u2192 General self-supervised (broad knowledge via SmolTalk)
81
+ \u2502 \u2514\u2500\u2500 \u2192 Math self-supervised (reasoning via GSM8K)
82
+ \u2502 \u2514\u2500\u2500 \u2192 Chat SFT (final instruction tuning)
83
+ \u251c\u2500\u2500 \u2192 Direct mid-training (comparison branch)
84
+ \u2514\u2500\u2500 \u2192 Other experimental forks
85
+ ```
86
+
87
+ The self-supervised stages use **pseudo-labeling**: the model generates candidate
88
+ responses for unlabeled prompts, scores them by confidence (average log-probability)
89
+ or task reward, filters to the highest-quality candidates, and trains on the
90
+ result. This loop can be iterated multiple times, progressively improving the
91
+ model's own training signal.
92
+
93
+ The hypothesis driving the pipeline order is that learning **code structure
94
+ first** (syntax, nesting, logical flow) provides transferable structural priors
95
+ that benefit subsequent natural language learning -- the model learns "systems
96
+ of systems" thinking from code before encountering sentence structure and
97
+ general knowledge.
98
 
99
  ## Model Details
100
 
101
  | Parameter | Value |
102
  |-----------|-------|
103
+ | Architecture | CRATE-\u03b1 |
104
  | Layers | 12 |
105
  | Hidden dim | 768 |
106
  | Attention heads | 6 |
107
  | Vocab size | 50304 |
108
  | Max sequence length | 1024 |
109
  | Window pattern | SSSL |
110
+ | ODL expansion | 4\u00d7 (overcomplete dictionary) |
111
+ | Sparse activation | ReLU with learnable threshold |
112
  | Training step | 20,000 |
113
  | Validation BPB | 1.1131 |
114
  | Smooth train loss | 3.7495 |
 
132
  model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
133
  ```
134
 
135
+ ## References
136
+
137
+ - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
138
+ - Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-\u03b1
139
+ - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
140
+
141
  ## License
142
 
143
  This model is released under the **MIT License**.
 
145
  Built on:
146
  - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
147
  - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
148
+ - [CRATE-\u03b1](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA