throbbey
/

crate-d12-base

@@ -9,10 +9,15 @@ license: mit
 # crate-d12-base
-A **CRATE-\u03b1** (Coding RAte reduction TransformEr) language model trained with
-[nanochat](https://github.com/karpathy/nanochat). This checkpoint serves as the
-**baseline** for a series of experiments exploring self-supervised learning for
-mid-training and fine-tuning with the CRATE architecture.
 ## What is CRATE?
@@ -27,7 +32,7 @@ two operations:
    is compressing token representations into low-dimensional subspaces.
 2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
-   projects tokens into an overcomplete dictionary space (4\u00d7 expansion),
    applies a sparse activation, and projects back. This encourages the model to
    learn sparse, interpretable representations at every layer.
@@ -39,15 +44,15 @@ those of standard transformers.
 ### Why ReLU Instead of Soft-Thresholding?
 The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
-as the sparse activation: \(S_\lambda(x) = \text{sign}(x) \cdot \max(|x| - \lambda, 0)\).
 This is the theoretically "correct" proximal operator for L1-regularized sparse
 coding, but it caused training instability at scale.
-CRATE-\u03b1 (NeurIPS 2024) introduced three modifications that enable scaling:
-| Change | Vanilla CRATE | CRATE-\u03b1 |
 |--------|--------------|------------|
-| Dictionary | Complete (d \u00d7 d) | Overcomplete (d \u00d7 4d) |
 | Activation | Soft-threshold | **ReLU** with learnable bias |
 | Sparse block | No residual | **Residual connection** |
@@ -75,13 +80,13 @@ This checkpoint is the starting point for a multi-stage experimental pipeline:
 ```
 crate-d12-base (this model)
-\u251c\u2500\u2500 \u2192 Code self-supervised (learn structural patterns from code)
-\u2502      \u2514\u2500\u2500 \u2192 Mid-training (adapt to chat/instruction format)
-\u2502             \u2514\u2500\u2500 \u2192 General self-supervised (broad knowledge via SmolTalk)
-\u2502                    \u2514\u2500\u2500 \u2192 Math self-supervised (reasoning via GSM8K)
-\u2502                           \u2514\u2500\u2500 \u2192 Chat SFT (final instruction tuning)
-\u251c\u2500\u2500 \u2192 Direct mid-training (comparison branch)
-\u2514\u2500\u2500 \u2192 Other experimental forks
 ```
 The self-supervised stages use **pseudo-labeling**: the model generates candidate
@@ -100,14 +105,14 @@ general knowledge.
 | Parameter | Value |
 |-----------|-------|
-| Architecture | CRATE-\u03b1 |
 | Layers | 12 |
 | Hidden dim | 768 |
 | Attention heads | 6 |
 | Vocab size | 50304 |
 | Max sequence length | 1024 |
 | Window pattern | SSSL |
-| ODL expansion | 4\u00d7 (overcomplete dictionary) |
 | Sparse activation | ReLU with learnable threshold |
 | Training step | 20,000 |
 | Validation BPB | 1.1131 |
@@ -135,7 +140,7 @@ model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, devic
 ## References
 - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
-- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-\u03b1
 - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
 ## License
@@ -143,6 +148,7 @@ model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, devic
 This model is released under the **MIT License**.
 Built on:
 - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
 - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
-- [CRATE-\u03b1](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA

 # crate-d12-base
+A **CRATE-α** (Coding RAte reduction TransformEr) language model trained with
+[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
+[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
+white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
+pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.
+This checkpoint serves as the **baseline** for a series of experiments exploring
+self-supervised learning for mid-training and fine-tuning with the CRATE
+architecture.
 ## What is CRATE?
    is compressing token representations into low-dimensional subspaces.
 2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
+   projects tokens into an overcomplete dictionary space (4× expansion),
    applies a sparse activation, and projects back. This encourages the model to
    learn sparse, interpretable representations at every layer.
 ### Why ReLU Instead of Soft-Thresholding?
 The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
+as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
 This is the theoretically "correct" proximal operator for L1-regularized sparse
 coding, but it caused training instability at scale.
+CRATE-α (NeurIPS 2024) introduced three modifications that enable scaling:
+| Change | Vanilla CRATE | CRATE-α |
 |--------|--------------|------------|
+| Dictionary | Complete (d × d) | Overcomplete (d × 4d) |
 | Activation | Soft-threshold | **ReLU** with learnable bias |
 | Sparse block | No residual | **Residual connection** |
 ```
 crate-d12-base (this model)
+├── → Code self-supervised (learn structural patterns from code)
+│      └── → Mid-training (adapt to chat/instruction format)
+│             └── → General self-supervised (broad knowledge via SmolTalk)
+│                    └── → Math self-supervised (reasoning via GSM8K)
+│                           └── → Chat SFT (final instruction tuning)
+├── → Direct mid-training (comparison branch)
+└── → Other experimental forks
 ```
 The self-supervised stages use **pseudo-labeling**: the model generates candidate
 | Parameter | Value |
 |-----------|-------|
+| Architecture | CRATE-α |
 | Layers | 12 |
 | Hidden dim | 768 |
 | Attention heads | 6 |
 | Vocab size | 50304 |
 | Max sequence length | 1024 |
 | Window pattern | SSSL |
+| ODL expansion | 4× (overcomplete dictionary) |
 | Sparse activation | ReLU with learnable threshold |
 | Training step | 20,000 |
 | Validation BPB | 1.1131 |
 ## References
 - Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
+- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-α
 - Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
 ## License
 This model is released under the **MIT License**.
 Built on:
+- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
 - [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
 - [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
+- [CRATE-α](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA