Fix Unicode characters, add nanochat-crate-a fork reference
Browse files
README.md
CHANGED
|
@@ -9,10 +9,15 @@ license: mit
|
|
| 9 |
|
| 10 |
# crate-d12-base
|
| 11 |
|
| 12 |
-
A **CRATE-
|
| 13 |
-
[nanochat](https://github.com/
|
| 14 |
-
|
| 15 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## What is CRATE?
|
| 18 |
|
|
@@ -27,7 +32,7 @@ two operations:
|
|
| 27 |
is compressing token representations into low-dimensional subspaces.
|
| 28 |
|
| 29 |
2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
|
| 30 |
-
projects tokens into an overcomplete dictionary space (4
|
| 31 |
applies a sparse activation, and projects back. This encourages the model to
|
| 32 |
learn sparse, interpretable representations at every layer.
|
| 33 |
|
|
@@ -39,15 +44,15 @@ those of standard transformers.
|
|
| 39 |
### Why ReLU Instead of Soft-Thresholding?
|
| 40 |
|
| 41 |
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
|
| 42 |
-
as the sparse activation:
|
| 43 |
This is the theoretically "correct" proximal operator for L1-regularized sparse
|
| 44 |
coding, but it caused training instability at scale.
|
| 45 |
|
| 46 |
-
CRATE-
|
| 47 |
|
| 48 |
-
| Change | Vanilla CRATE | CRATE-
|
| 49 |
|--------|--------------|------------|
|
| 50 |
-
| Dictionary | Complete (d
|
| 51 |
| Activation | Soft-threshold | **ReLU** with learnable bias |
|
| 52 |
| Sparse block | No residual | **Residual connection** |
|
| 53 |
|
|
@@ -75,13 +80,13 @@ This checkpoint is the starting point for a multi-stage experimental pipeline:
|
|
| 75 |
|
| 76 |
```
|
| 77 |
crate-d12-base (this model)
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
```
|
| 86 |
|
| 87 |
The self-supervised stages use **pseudo-labeling**: the model generates candidate
|
|
@@ -100,14 +105,14 @@ general knowledge.
|
|
| 100 |
|
| 101 |
| Parameter | Value |
|
| 102 |
|-----------|-------|
|
| 103 |
-
| Architecture | CRATE-
|
| 104 |
| Layers | 12 |
|
| 105 |
| Hidden dim | 768 |
|
| 106 |
| Attention heads | 6 |
|
| 107 |
| Vocab size | 50304 |
|
| 108 |
| Max sequence length | 1024 |
|
| 109 |
| Window pattern | SSSL |
|
| 110 |
-
| ODL expansion | 4
|
| 111 |
| Sparse activation | ReLU with learnable threshold |
|
| 112 |
| Training step | 20,000 |
|
| 113 |
| Validation BPB | 1.1131 |
|
|
@@ -135,7 +140,7 @@ model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, devic
|
|
| 135 |
## References
|
| 136 |
|
| 137 |
- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
|
| 138 |
-
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-
|
| 139 |
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
|
| 140 |
|
| 141 |
## License
|
|
@@ -143,6 +148,7 @@ model, tokenizer, meta = build_model("path/to/downloaded/dir", step=20000, devic
|
|
| 143 |
This model is released under the **MIT License**.
|
| 144 |
|
| 145 |
Built on:
|
|
|
|
| 146 |
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
|
| 147 |
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
|
| 148 |
-
- [CRATE-
|
|
|
|
| 9 |
|
| 10 |
# crate-d12-base
|
| 11 |
|
| 12 |
+
A **CRATE-Ξ±** (Coding RAte reduction TransformEr) language model trained with
|
| 13 |
+
[nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a), a fork of
|
| 14 |
+
[nanochat](https://github.com/karpathy/nanochat) that integrates the CRATE
|
| 15 |
+
white-box transformer architecture, SDPA/Flash Attention, and a self-supervised
|
| 16 |
+
pseudo-labeling pipeline for domain-specific mid-training and fine-tuning.
|
| 17 |
+
|
| 18 |
+
This checkpoint serves as the **baseline** for a series of experiments exploring
|
| 19 |
+
self-supervised learning for mid-training and fine-tuning with the CRATE
|
| 20 |
+
architecture.
|
| 21 |
|
| 22 |
## What is CRATE?
|
| 23 |
|
|
|
|
| 32 |
is compressing token representations into low-dimensional subspaces.
|
| 33 |
|
| 34 |
2. **ODL (Overcomplete Dictionary Learning)** -- a *sparsification* step that
|
| 35 |
+
projects tokens into an overcomplete dictionary space (4Γ expansion),
|
| 36 |
applies a sparse activation, and projects back. This encourages the model to
|
| 37 |
learn sparse, interpretable representations at every layer.
|
| 38 |
|
|
|
|
| 44 |
### Why ReLU Instead of Soft-Thresholding?
|
| 45 |
|
| 46 |
The original CRATE paper (NeurIPS 2023) used ISTA-style **soft-thresholding**
|
| 47 |
+
as the sparse activation: `S_lambda(x) = sign(x) * max(|x| - lambda, 0)`.
|
| 48 |
This is the theoretically "correct" proximal operator for L1-regularized sparse
|
| 49 |
coding, but it caused training instability at scale.
|
| 50 |
|
| 51 |
+
CRATE-Ξ± (NeurIPS 2024) introduced three modifications that enable scaling:
|
| 52 |
|
| 53 |
+
| Change | Vanilla CRATE | CRATE-Ξ± |
|
| 54 |
|--------|--------------|------------|
|
| 55 |
+
| Dictionary | Complete (d Γ d) | Overcomplete (d Γ 4d) |
|
| 56 |
| Activation | Soft-threshold | **ReLU** with learnable bias |
|
| 57 |
| Sparse block | No residual | **Residual connection** |
|
| 58 |
|
|
|
|
| 80 |
|
| 81 |
```
|
| 82 |
crate-d12-base (this model)
|
| 83 |
+
βββ β Code self-supervised (learn structural patterns from code)
|
| 84 |
+
β βββ β Mid-training (adapt to chat/instruction format)
|
| 85 |
+
β βββ β General self-supervised (broad knowledge via SmolTalk)
|
| 86 |
+
β βββ β Math self-supervised (reasoning via GSM8K)
|
| 87 |
+
β βββ β Chat SFT (final instruction tuning)
|
| 88 |
+
βββ β Direct mid-training (comparison branch)
|
| 89 |
+
βββ β Other experimental forks
|
| 90 |
```
|
| 91 |
|
| 92 |
The self-supervised stages use **pseudo-labeling**: the model generates candidate
|
|
|
|
| 105 |
|
| 106 |
| Parameter | Value |
|
| 107 |
|-----------|-------|
|
| 108 |
+
| Architecture | CRATE-Ξ± |
|
| 109 |
| Layers | 12 |
|
| 110 |
| Hidden dim | 768 |
|
| 111 |
| Attention heads | 6 |
|
| 112 |
| Vocab size | 50304 |
|
| 113 |
| Max sequence length | 1024 |
|
| 114 |
| Window pattern | SSSL |
|
| 115 |
+
| ODL expansion | 4Γ (overcomplete dictionary) |
|
| 116 |
| Sparse activation | ReLU with learnable threshold |
|
| 117 |
| Training step | 20,000 |
|
| 118 |
| Validation BPB | 1.1131 |
|
|
|
|
| 140 |
## References
|
| 141 |
|
| 142 |
- Yu et al., "White-Box Transformers via Sparse Rate Reduction" (NeurIPS 2023) -- original CRATE
|
| 143 |
+
- Yang et al., "Scaling White-Box Transformers for Vision" (NeurIPS 2024) -- CRATE-Ξ±
|
| 144 |
- Hendrycks et al., "Measuring Massive Multitask Language Understanding" (ICLR 2021) -- MMLU
|
| 145 |
|
| 146 |
## License
|
|
|
|
| 148 |
This model is released under the **MIT License**.
|
| 149 |
|
| 150 |
Built on:
|
| 151 |
+
- [nanochat-crate-a](https://github.com/modularflow/nanochat-crate-a) -- CRATE integration, self-supervised pipeline, SDPA/Flash Attention
|
| 152 |
- [nanochat](https://github.com/karpathy/nanochat) by Andrej Karpathy -- MIT License, Copyright (c) 2025
|
| 153 |
- [CRATE](https://github.com/Ma-Lab-Berkeley/CRATE) (White-Box Transformers via Sparse Rate Reduction) by Ma-Lab-Berkeley -- MIT License, Copyright (c) 2023
|
| 154 |
+
- [CRATE-Ξ±](https://github.com/UCSC-VLAA/CRATE-alpha) (Scaling White-Box Transformers for Vision) by UCSC-VLAA
|