File size: 6,714 Bytes
ffc32d3
 
 
 
1f52a7e
 
ffc32d3
 
 
 
 
bd7abe2
 
 
 
 
 
 
 
 
1f52a7e
 
 
 
 
 
 
 
 
 
 
 
 
 
bd7abe2
1f52a7e
 
 
 
 
 
 
 
 
 
 
bd7abe2
1f52a7e
1e35a07
1f52a7e
bd7abe2
1f52a7e
bd7abe2
1f52a7e
bd7abe2
1f52a7e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bd7abe2
 
 
 
 
 
 
1f52a7e
 
 
 
 
 
 
 
 
 
 
 
 
ffc32d3
 
 
 
 
bd7abe2
ffc32d3
 
 
 
 
 
bd7abe2
1f52a7e
ffc32d3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
caa22cd
1f52a7e
 
 
bd7abe2
1f52a7e
 
caa22cd
 
 
 
 
bd7abe2
caa22cd
 
bd7abe2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
---
tags:
- nanochat
- crate
- white-box
- sparse-coding
license: mit
---

# crate-d12-base

A **CRATE-Ξ±** (Coding RAte reduction TransformEr) language model trained with
[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.

This checkpoint serves as the **baseline** for a series of experiments exploring
self-supervised learning for mid-training and fine-tuning with the CRATE
architecture.

## What is CRATE?

CRATE is a **white-box transformer** -- unlike standard transformers where the
architecture is heuristically designed, every layer of CRATE is mathematically
derived from a principled optimization objective. Each layer alternates between
two operations:

1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
   performs gradient descent on the *coding rate reduction* objective. Q, K, and
   V share a single tied projection matrix, which means the attention operation
   is compressing token representations into low-dimensional subspaces.

2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
   projects tokens into an overcomplete dictionary space (4Γ— expansion),
   applies a sparse activation, and projects back. This encourages the model to
   learn sparse, interpretable representations at every layer.

The net effect is that each forward pass solves a structured optimization
problem: *compress* and *sparsify* the representation, layer by layer. The
resulting internal representations are significantly more interpretable than
those of standard transformers.

### Why ReLU Instead of Soft-Thresholding?

The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
This is the theoretically "correct" proximal operator for L1-regularized sparse
coding, but it caused training instability at scale. The git repo has options to use either. 

CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling:

| Change | Vanilla CRATE | CRATE-Ξ± |
|--------|--------------|------------|
| Dictionary | Complete (d Γ— d) | Overcomplete (d Γ— 4d) |
| Activation | Soft-threshold | **ReLU** with learnable bias |
| Sparse block | No residual | **Residual connection** |

**ReLU** works better for scaling because: (a) it has a well-behaved gradient
everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
each neuron to adaptively set its own sparsity level during training, and
(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
structurally similar to a standard MLP -- but it is *derived from* sparse coding
principles rather than heuristically chosen, giving it a principled
interpretation as dictionary learning.

## Evaluation: MMLU

This model is evaluated against **MMLU** (Massive Multitask Language
Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
sciences, and professional domains. MMLU tests the model's ability to answer
multiple-choice questions requiring world knowledge and reasoning -- from
abstract algebra and anatomy to US foreign policy and virology. It provides a
broad signal for how much general knowledge the model has absorbed during
pre-training.

## Baseline for Self-Supervised Experiments

This checkpoint is the starting point for a multi-stage experimental pipeline:

```
crate-d12-base (this model)
β”œβ”€β”€ β†’ Code self-supervised (learn structural patterns from code)
β”‚      └── β†’ Mid-training (adapt to chat/instruction format)
β”‚             └── β†’ General self-supervised (broad knowledge via SmolTalk)
β”‚                    └── β†’ Math self-supervised (reasoning via GSM8K)
β”‚                           └── β†’ Chat SFT (final instruction tuning)
β”œβ”€β”€ β†’ Direct mid-training (comparison branch)
└── β†’ Other experimental forks
```

The self-supervised stages use **pseudo-labeling**: the model generates candidate
responses for unlabeled prompts, scores them by confidence (average log-probability)
or task reward, filters to the highest-quality candidates, and trains on the
result. This loop can be iterated multiple times, progressively improving the
model's own training signal.

The hypothesis driving the pipeline order is that learning **code structure
first** (syntax, nesting, logical flow) provides transferable structural priors
that benefit subsequent natural language learning -- the model learns "systems
of systems" thinking from code before encountering sentence structure and
general knowledge.

## Model Details

| Parameter | Value |
|-----------|-------|
| Architecture | CRATE-Ξ± |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 6 |
| Vocab size | 50304 |
| Max sequence length | 1024 |
| Window pattern | SSSL |
| ODL expansion | 4Γ— (overcomplete dictionary) |
| Sparse activation | ReLU with learnable threshold |
| Training step | 20,000 |
| Validation BPB | 1.1131 |
| Smooth train loss | 3.7495 |
| Training time | 3.4 hours |
| Run name | 4090-crate-a |
| Batch size (tokens) | 65536 |

## Files

- `model.safetensors` -- model weights in safetensors format
- `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`)
- `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding)
- `token_bytes.pt` -- token byte mappings
- `meta.json` -- full training metadata from the checkpoint

## Usage

```python
from nanochat.checkpoint_manager import build_model

model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
```

## References

- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ±
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU

## License

This model is released under the **MIT License**.

Built on:
- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
- [CRATE-Ξ±](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA