throbbey commited on
Commit
bd7abe2
Β·
verified Β·
1 Parent(s): 1f52a7e

Fix Unicode characters, add nanochat-crate-a fork reference

Browse files
Files changed (1) hide show
  1. README.md +26 -20
README.md CHANGED
@@ -9,10 +9,15 @@ license: mit
9
 
10
  # crate-d12-base
11
 
12
- A **CRATE-\u03b1** (Coding RAte reduction TransformEr) language model trained with
13
- [nanochat](https://github.com/karpathy/nanochat). This checkpoint serves as the
14
- **baseline** for a series of experiments exploring self-supervised learning for
15
- mid-training and fine-tuning with the CRATE architecture.
 
 
 
 
 
16
 
17
  ## What is CRATE?
18
 
@@ -27,7 +32,7 @@ two operations:
27
  is compressing token representations into low-dimensional subspaces.
28
 
29
  2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
30
- projects tokens into an overcomplete dictionary space (4\u00d7 expansion),
31
  applies a sparse activation, and projects back. This encourages the model to
32
  learn sparse, interpretable representations at every layer.
33
 
@@ -39,15 +44,15 @@ those of standard transformers.
39
  ### Why ReLU Instead of Soft-Thresholding?
40
 
41
  The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
42
- as the sparse activation: \(S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)\).
43
  This is the theoretically "correct" proximal operator for L1-regularized sparse
44
  coding, but it caused training instability at scale.
45
 
46
- CRATE-\u03b1 (NeurIPS 2024) introduced three modifications that enable scaling:
47
 
48
- | Change | Vanilla CRATE | CRATE-\u03b1 |
49
  |--------|--------------|------------|
50
- | Dictionary | Complete (d \u00d7 d) | Overcomplete (d \u00d7 4d) |
51
  | Activation | Soft-threshold | **ReLU** with learnable bias |
52
  | Sparse block | No residual | **Residual connection** |
53
 
@@ -75,13 +80,13 @@ This checkpoint is the starting point for a multi-stage experimental pipeline:
75
 
76
  ```
77
  crate-d12-base (this model)
78
- \u251c\u2500\u2500 \u2192 Code self-supervised (learn structural patterns from code)
79
- \u2502 \u2514\u2500\u2500 \u2192 Mid-training (adapt to chat/instruction format)
80
- \u2502 \u2514\u2500\u2500 \u2192 General self-supervised (broad knowledge via SmolTalk)
81
- \u2502 \u2514\u2500\u2500 \u2192 Math self-supervised (reasoning via GSM8K)
82
- \u2502 \u2514\u2500\u2500 \u2192 Chat SFT (final instruction tuning)
83
- \u251c\u2500\u2500 \u2192 Direct mid-training (comparison branch)
84
- \u2514\u2500\u2500 \u2192 Other experimental forks
85
  ```
86
 
87
  The self-supervised stages use **pseudo-labeling**: the model generates candidate
@@ -100,14 +105,14 @@ general knowledge.
100
 
101
  | Parameter | Value |
102
  |-----------|-------|
103
- | Architecture | CRATE-\u03b1 |
104
  | Layers | 12 |
105
  | Hidden dim | 768 |
106
  | Attention heads | 6 |
107
  | Vocab size | 50304 |
108
  | Max sequence length | 1024 |
109
  | Window pattern | SSSL |
110
- | ODL expansion | 4\u00d7 (overcomplete dictionary) |
111
  | Sparse activation | ReLU with learnable threshold |
112
  | Training step | 20,000 |
113
  | Validation BPB | 1.1131 |
@@ -135,7 +140,7 @@ model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, devic
135
  ## References
136
 
137
  - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
138
- - Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-\u03b1
139
  - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
140
 
141
  ## License
@@ -143,6 +148,7 @@ model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, devic
143
  This model is released under the **MIT License**.
144
 
145
  Built on:
 
146
  - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
147
  - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
148
- - [CRATE-\u03b1](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA
 
9
 
10
  # crate-d12-base
11
 
12
+ A **CRATE-Ξ±** (Coding RAte reduction TransformEr) language model trained with
13
+ [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
14
+ [nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
15
+ white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
16
+ pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.
17
+
18
+ This checkpoint serves as the **baseline** for a series of experiments exploring
19
+ self-supervised learning for mid-training and fine-tuning with the CRATE
20
+ architecture.
21
 
22
  ## What is CRATE?
23
 
 
32
  is compressing token representations into low-dimensional subspaces.
33
 
34
  2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
35
+ projects tokens into an overcomplete dictionary space (4Γ— expansion),
36
  applies a sparse activation, and projects back. This encourages the model to
37
  learn sparse, interpretable representations at every layer.
38
 
 
44
  ### Why ReLU Instead of Soft-Thresholding?
45
 
46
  The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
47
+ as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
48
  This is the theoretically "correct" proximal operator for L1-regularized sparse
49
  coding, but it caused training instability at scale.
50
 
51
+ CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling:
52
 
53
+ | Change | Vanilla CRATE | CRATE-Ξ± |
54
  |--------|--------------|------------|
55
+ | Dictionary | Complete (d Γ— d) | Overcomplete (d Γ— 4d) |
56
  | Activation | Soft-threshold | **ReLU** with learnable bias |
57
  | Sparse block | No residual | **Residual connection** |
58
 
 
80
 
81
  ```
82
  crate-d12-base (this model)
83
+ β”œβ”€β”€ β†’ Code self-supervised (learn structural patterns from code)
84
+ β”‚ └── β†’ Mid-training (adapt to chat/instruction format)
85
+ β”‚ └── β†’ General self-supervised (broad knowledge via SmolTalk)
86
+ β”‚ └── β†’ Math self-supervised (reasoning via GSM8K)
87
+ β”‚ └── β†’ Chat SFT (final instruction tuning)
88
+ β”œβ”€β”€ β†’ Direct mid-training (comparison branch)
89
+ └── β†’ Other experimental forks
90
  ```
91
 
92
  The self-supervised stages use **pseudo-labeling**: the model generates candidate
 
105
 
106
  | Parameter | Value |
107
  |-----------|-------|
108
+ | Architecture | CRATE-Ξ± |
109
  | Layers | 12 |
110
  | Hidden dim | 768 |
111
  | Attention heads | 6 |
112
  | Vocab size | 50304 |
113
  | Max sequence length | 1024 |
114
  | Window pattern | SSSL |
115
+ | ODL expansion | 4Γ— (overcomplete dictionary) |
116
  | Sparse activation | ReLU with learnable threshold |
117
  | Training step | 20,000 |
118
  | Validation BPB | 1.1131 |
 
140
  ## References
141
 
142
  - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
143
+ - Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ±
144
  - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
145
 
146
  ## License
 
148
  This model is released under the **MIT License**.
149
 
150
  Built on:
151
+ - [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
152
  - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
153
  - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
154
+ - [CRATE-Ξ±](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA