Add README
Browse files
README.md
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CPU-1 Ablation Study — Ready-to-use Checkpoints
|
| 2 |
+
|
| 3 |
+
Repo: `Cukinator/cpu1-ablations-final`
|
| 4 |
+
Source (compact 2-bit): `Cukinator/cpu1-ablation-checkpoints`
|
| 5 |
+
|
| 6 |
+
Each checkpoint is a standard PyTorch `.pt` file with float32 weights — no
|
| 7 |
+
unpacking needed. Compatible with `train_ablation.py` from the
|
| 8 |
+
[1.58bits repo](https://github.com/Cukinator/1.58bits).
|
| 9 |
+
|
| 10 |
+
## Quick start
|
| 11 |
+
|
| 12 |
+
```python
|
| 13 |
+
import torch, sys
|
| 14 |
+
sys.path.insert(0, "/path/to/1.58bits")
|
| 15 |
+
from train_ablation import build_ablation_model, generate
|
| 16 |
+
|
| 17 |
+
ckpt = torch.load("run_02/model.pt", map_location="cpu")
|
| 18 |
+
model = build_ablation_model(ckpt["config"])
|
| 19 |
+
model.load_state_dict(ckpt["state_dict"])
|
| 20 |
+
model.eval()
|
| 21 |
+
|
| 22 |
+
text = generate(model, "The quick brown fox", 128, ckpt["config"], torch.device("cpu"))
|
| 23 |
+
print(text)
|
| 24 |
+
```
|
| 25 |
+
|
| 26 |
+
## Ablation chain
|
| 27 |
+
|
| 28 |
+
| Run | Architecture | Val Loss | Perplexity |
|
| 29 |
+
|-----|-------------|----------|-----------|
|
| 30 |
+
| run_01 | Transformer + BPE + FP16 (baseline) | 4.66 | 106.1 |
|
| 31 |
+
| run_02a | Transformer + Byte + 4 heads (no LBD) | 2.31 | 10.1 |
|
| 32 |
+
| run_02 | Transformer + Byte + LocalByteDecoder | 1.72 | 5.56 |
|
| 33 |
+
| run_03 | MLGRU + Byte + FP16 | 1.87 | 6.49 |
|
| 34 |
+
| run_04 | MLGRU + Byte + Ternary | 5.57 | 261.7 |
|
| 35 |
+
| run_05 | + FPResidual | 5.55 | 257.7 |
|
| 36 |
+
| run_05b | MLGRU kernel strict | 5.59 | 268.8 |
|
| 37 |
+
| run_06 | + BolmoPatchEmbedding | 5.56 | 258.8 |
|
| 38 |
+
| run_07 | + DeleteGate (CPU-1 complete) | 5.56 | 258.8 |
|
| 39 |
+
| run_10 | + learned per-channel decay | 5.53 | 253.1 |
|
| 40 |
+
| run_13 | Small 10M BPE model | 30.5 | — |
|
| 41 |
+
|
| 42 |
+
> **Note**: ternary runs (04–10) were trained with only 2 tokens/param
|
| 43 |
+
> (ablation budget). High perplexity reflects under-training, not architecture
|
| 44 |
+
> failure. FP16 runs (01–03) are valid references.
|