crate-d12-base

A CRATE-α (Coding RAte reduction TransformEr) language model trained with nanochat-crate-a, a fork of nanochat that integrates the CRATE white-box transformer architecture, SDPA/Flash Attention, and a self-supervised pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.

This checkpoint serves as the baseline for a series of experiments exploring self-supervised learning for mid-training and fine-tuning with the CRATE architecture.

What is CRATE?

CRATE is a white-box transformer -- unlike standard transformers where the architecture is heuristically designed, every layer of CRATE is mathematically derived from a principled optimization objective. Each layer alternates between two operations:

MSSA (Multi-Head Subspace Self-Attention) -- a compression step that performs gradient descent on the coding rate reduction objective. Q, K, and V share a single tied projection matrix, which means the attention operation is compressing token representations into low-dimensional subspaces.
ODL (Overcomplete Dictionary Learning) -- a sparsification step that projects tokens into an overcomplete dictionary space (4× expansion), applies a sparse activation, and projects back. This encourages the model to learn sparse, interpretable representations at every layer.

The net effect is that each forward pass solves a structured optimization problem: compress and sparsify the representation, layer by layer. The resulting internal representations are significantly more interpretable than those of standard transformers.

Why ReLU Instead of Soft-Thresholding?

The original CRATE paper (NeurIPS 2023) used ISTA-style soft-thresholding as the sparse activation: S_lambda(x) = sign(x) * max(|x| - lambda, 0). This is the theoretically "correct" proximal operator for L1-regularized sparse coding, but it caused training instability at scale. The git repo has options to use either.

CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:

Change	Vanilla CRATE	CRATE-α
Dictionary	Complete (d × d)	Overcomplete (d × 4d)
Activation	Soft-threshold	ReLU with learnable bias
Sparse block	No residual	Residual connection

ReLU works better for scaling because: (a) it has a well-behaved gradient everywhere (no sign discontinuity), (b) the learnable threshold/bias allows each neuron to adaptively set its own sparsity level during training, and (c) ReLU is heavily optimized in GPU kernels. The resulting ODL block looks structurally similar to a standard MLP -- but it is derived from sparse coding principles rather than heuristically chosen, giving it a principled interpretation as dictionary learning.

Evaluation: MMLU

This model is evaluated against MMLU (Massive Multitask Language Understanding), a benchmark of 57 subjects spanning STEM, humanities, social sciences, and professional domains. MMLU tests the model's ability to answer multiple-choice questions requiring world knowledge and reasoning -- from abstract algebra and anatomy to US foreign policy and virology. It provides a broad signal for how much general knowledge the model has absorbed during pre-training.

Baseline for Self-Supervised Experiments

This checkpoint is the starting point for a multi-stage experimental pipeline:

crate-d12-base (this model)
├── → Code self-supervised (learn structural patterns from code)
│      └── → Mid-training (adapt to chat/instruction format)
│             └── → General self-supervised (broad knowledge via SmolTalk)
│                    └── → Math self-supervised (reasoning via GSM8K)
│                           └── → Chat SFT (final instruction tuning)
├── → Direct mid-training (comparison branch)
└── → Other experimental forks

The self-supervised stages use pseudo-labeling: the model generates candidate responses for unlabeled prompts, scores them by confidence (average log-probability) or task reward, filters to the highest-quality candidates, and trains on the result. This loop can be iterated multiple times, progressively improving the model's own training signal.

The hypothesis driving the pipeline order is that learning code structure first (syntax, nesting, logical flow) provides transferable structural priors that benefit subsequent natural language learning -- the model learns "systems of systems" thinking from code before encountering sentence structure and general knowledge.

Model Details

Parameter	Value
Architecture	CRATE-α
Layers	12
Hidden dim	768
Attention heads	6
Vocab size	50304
Max sequence length	1024
Window pattern	SSSL
ODL expansion	4× (overcomplete dictionary)
Sparse activation	ReLU with learnable threshold
Training step	20,000
Validation BPB	1.1131
Smooth train loss	3.7495
Training time	3.4 hours
Run name	4090-crate-a
Batch size (tokens)	65536

Files

model.safetensors -- model weights in safetensors format
config.json -- model architecture config (reconstruct with CRATEConfig(**config))
tokenizer.pkl -- BPE tokenizer (pickle of tiktoken Encoding)
token_bytes.pt -- token byte mappings
meta.json -- full training metadata from the checkpoint

Usage

from nanochat.checkpoint_manager import build_model

model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, device=torch.device("cuda"), phase="eval")

References

Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-α
Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU

License

This model is released under the MIT License.

Built on:

nanochat-crate-a -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
CRATE (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
CRATE-α (Scaling White-Box Transformers for Vision) by UCSC-VLAA

Downloads last month: 2

Safetensors

Model size

0.1B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support