Update README.md

1e35a07 verified 8 days ago

6.71 kB

	---
	tags:
	- nanochat
	- crate
	- white-box
	- sparse-coding
	license: mit
	---

	# crate-d12-base

	A CRATE-α (Coding RAte reduction TransformEr) language model trained with
	[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
	[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
	white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
	pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.

	This checkpoint serves as the baseline for a series of experiments exploring
	self-supervised learning for mid-training and fine-tuning with the CRATE
	architecture.

	## What is CRATE?

	CRATE is a white-box transformer -- unlike standard transformers where the
	architecture is heuristically designed, every layer of CRATE is mathematically
	derived from a principled optimization objective. Each layer alternates between
	two operations:

	1. MSSA (Multi-Head Subspace Self-Attention) -- a compression step that
	performs gradient descent on the coding rate reduction objective. Q, K, and
	V share a single tied projection matrix, which means the attention operation
	is compressing token representations into low-dimensional subspaces.

	2. ODL (Overcomplete Dictionary Learning) -- a sparsification step that
	projects tokens into an overcomplete dictionary space (4× expansion),
	applies a sparse activation, and projects back. This encourages the model to
	learn sparse, interpretable representations at every layer.

	The net effect is that each forward pass solves a structured optimization
	problem: compress and sparsify the representation, layer by layer. The
	resulting internal representations are significantly more interpretable than
	those of standard transformers.

	### Why ReLU Instead of Soft-Thresholding?

	The original CRATE paper (NeurIPS 2023) used ISTA-style soft-thresholding
	as the sparse activation: `S_lambda(x) = sign(x) * max(\|x\| - lambda, 0)`.
	This is the theoretically "correct" proximal operator for L1-regularized sparse
	coding, but it caused training instability at scale. The git repo has options to use either.

	CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:

	\| Change \| Vanilla CRATE \| CRATE-α \|
	\|--------\|--------------\|------------\|
	\| Dictionary \| Complete (d × d) \| Overcomplete (d × 4d) \|
	\| Activation \| Soft-threshold \| ReLU with learnable bias \|
	\| Sparse block \| No residual \| Residual connection \|

	ReLU works better for scaling because: (a) it has a well-behaved gradient
	everywhere (no sign discontinuity), (b) the learnable threshold/bias allows
	each neuron to adaptively set its own sparsity level during training, and
	(c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks
	structurally similar to a standard MLP -- but it is derived from sparse coding
	principles rather than heuristically chosen, giving it a principled
	interpretation as dictionary learning.

	## Evaluation: MMLU

	This model is evaluated against MMLU (Massive Multitask Language
	Understanding), a benchmark of 57 subjects spanning STEM, humanities, social
	sciences, and professional domains. MMLU tests the model's ability to answer
	multiple-choice questions requiring world knowledge and reasoning -- from
	abstract algebra and anatomy to US foreign policy and virology. It provides a
	broad signal for how much general knowledge the model has absorbed during
	pre-training.

	## Baseline for Self-Supervised Experiments

	This checkpoint is the starting point for a multi-stage experimental pipeline:

	```
	crate-d12-base (this model)
	├── → Code self-supervised (learn structural patterns from code)
	│ └── → Mid-training (adapt to chat/instruction format)
	│ └── → General self-supervised (broad knowledge via SmolTalk)
	│ └── → Math self-supervised (reasoning via GSM8K)
	│ └── → Chat SFT (final instruction tuning)
	├── → Direct mid-training (comparison branch)
	└── → Other experimental forks
	```

	The self-supervised stages use pseudo-labeling: the model generates candidate
	responses for unlabeled prompts, scores them by confidence (average log-probability)
	or task reward, filters to the highest-quality candidates, and trains on the
	result. This loop can be iterated multiple times, progressively improving the
	model's own training signal.

	The hypothesis driving the pipeline order is that learning **code structure
	first** (syntax, nesting, logical flow) provides transferable structural priors
	that benefit subsequent natural language learning -- the model learns "systems
	of systems" thinking from code before encountering sentence structure and
	general knowledge.

	## Model Details

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Architecture \| CRATE-α \|
	\| Layers \| 12 \|
	\| Hidden dim \| 768 \|
	\| Attention heads \| 6 \|
	\| Vocab size \| 50304 \|
	\| Max sequence length \| 1024 \|
	\| Window pattern \| SSSL \|
	\| ODL expansion \| 4× (overcomplete dictionary) \|
	\| Sparse activation \| ReLU with learnable threshold \|
	\| Training step \| 20,000 \|
	\| Validation BPB \| 1.1131 \|
	\| Smooth train loss \| 3.7495 \|
	\| Training time \| 3.4 hours \|
	\| Run name \| 4090-crate-a \|
	\| Batch size (tokens) \| 65536 \|

	## Files

	- `model.safetensors` -- model weights in safetensors format
	- `config.json` -- model architecture config (reconstruct with `CRATEConfig(**config)`)
	- `tokenizer.pkl` -- BPE tokenizer (pickle of tiktoken Encoding)
	- `token_bytes.pt` -- token byte mappings
	- `meta.json` -- full training metadata from the checkpoint

	## Usage

	```python
	from nanochat.checkpoint_manager import build_model

	model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")
	```

	## References

	- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
	- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-α
	- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU

	## License

	This model is released under the MIT License.

	Built on:
	- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
	- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
	- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
	- [CRATE-α](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA