crate-d12-base / README.md
throbbey's picture
Update README.md
1e35a07 verified
---
tags:
- nanochat
- crate
- white-box
- sparse-coding
license: mit
---
# crate-d12-base
A **CRATE-Ξ±** (Coding RAte reduction TransformEr) language model trained with
[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.
This checkpoint serves as the **baseline** for a series of experiments exploring
self-supervised learning for mid-training and fine-tuning with the CRATE
architecture.
## What is CRATE?
CRATE is a **white-box transformer** -- unlike standard transformers where the
architecture is heuristically designed, every layer of CRATE is mathematically
derived from a principled optimization objective. Each layer alternates between
two operations:
1. **MSSA (Multi-Head Subspace Self-Attention)** -- a *compression* step that
performs gradient descent on the *coding rate reduction* objective. Q, K, and
V share a single tied projection matrix, which means the attention operation
is compressing token representations into low-dimensional subspaces.
2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
projects tokens into an overcomplete dictionary space (4Γ— expansion),
applies a sparse activation, and projects back. This encourages the model to
learn sparse, interpretable representations at every layer.
The net effect is that each forward pass solves a structured optimization
problem: *compress* and *sparsify* the representation, layer by layer. The
resulting internal representations are significantly more interpretable than
those of standard transformers.
### Why ReLU Instead of Soft-Thresholding?
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
This is the theoretically "correct" proximal operator for L1-regularized sparse
coding, but it caused training instability at scale. The git repo has options to use either.
CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling:
| Change | Vanilla CRATE | CRATE-Ξ± |
|--------|--------------|------------|
| Dictionary | Complete (d Γ— d) | Overcomplete (d Γ— 4d) |
| Activation | Soft-threshold | **ReLU** with learnable bias |
| Sparse block | No residual | **Residual connection** |
**ReLU** works better for scaling because: (a) it has a well-behaved gradient
everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
each neuron to adaptively set its own sparsity level during training, and
(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
structurally similar to a standard MLP -- but it is *derived from* sparse coding
principles rather than heuristically chosen, giving it a principled
interpretation as dictionary learning.
## Evaluation: MMLU
This model is evaluated against **MMLU** (Massive Multitask Language
Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
sciences, and professional domains. MMLU tests the model's ability to answer
multiple-choice questions requiring world knowledge and reasoning -- from
abstract algebra and anatomy to US foreign policy and virology. It provides a
broad signal for how much general knowledge the model has absorbed during
pre-training.
## Baseline for Self-Supervised Experiments
This checkpoint is the starting point for a multi-stage experimental pipeline:
```
crate-d12-base (this model)
β”œβ”€β”€ β†’ Code self-supervised (learn structural patterns from code)
β”‚ └── β†’ Mid-training (adapt to chat/instruction format)
β”‚ └── β†’ General self-supervised (broad knowledge via SmolTalk)
β”‚ └── β†’ Math self-supervised (reasoning via GSM8K)
β”‚ └── β†’ Chat SFT (final instruction tuning)
β”œβ”€β”€ β†’ Direct mid-training (comparison branch)
└── β†’ Other experimental forks
```
The self-supervised stages use **pseudo-labeling**: the model generates candidate
responses for unlabeled prompts, scores them by confidence (average log-probability)
or task reward, filters to the highest-quality candidates, and trains on the
result. This loop can be iterated multiple times, progressively improving the
model's own training signal.
The hypothesis driving the pipeline order is that learning **code structure
first** (syntax, nesting, logical flow) provides transferable structural priors
that benefit subsequent natural language learning -- the model learns "systems
of systems" thinking from code before encountering sentence structure and
general knowledge.
## Model Details
| Parameter | Value |
|-----------|-------|
| Architecture | CRATE-Ξ± |
| Layers | 12 |
| Hidden dim | 768 |
| Attention heads | 6 |
| Vocab size | 50304 |
| Max sequence length | 1024 |
| Window pattern | SSSL |
| ODL expansion | 4Γ— (overcomplete dictionary) |
| Sparse activation | ReLU with learnable threshold |
| Training step | 20,000 |
| Validation BPB | 1.1131 |
| Smooth train loss | 3.7495 |
| Training time | 3.4 hours |
| Run name | 4090-crate-a |
| Batch size (tokens) | 65536 |
## Files
- `model.safetensors` -- model weights in safetensors format
- `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`)
- `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding)
- `token_bytes.pt` -- token byte mappings
- `meta.json` -- full training metadata from the checkpoint
## Usage
```python
from nanochat.checkpoint_manager import build_model
model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
```
## References
- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ±
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
## License
This model is released under the **MIT License**.
Built on:
- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
- [CRATE-Ξ±](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA