Cukinator
/

cpu1-ablations-final

Text Generation

Model card Files Files and versions

Cukinator commited on May 14

Commit

b93a0df

·

verified ·

1 Parent(s): 1ceac71

Add README

Files changed (1) hide show

README.md +35 -15

README.md CHANGED Viewed

@@ -1,12 +1,32 @@
 # CPU-1 Ablation Study — Ready-to-use Checkpoints
 Repo: `Cukinator/cpu1-ablations-final`
 Source (compact 2-bit): `Cukinator/cpu1-ablation-checkpoints`
-Each checkpoint is a standard PyTorch `.pt` file with float32 weights — no
-unpacking needed. Compatible with `train_ablation.py` from the
 [1.58bits repo](https://github.com/Cukinator/1.58bits).
 ## Quick start
 ```python
@@ -25,19 +45,19 @@ print(text)
 ## Ablation chain
-| Run | Architecture | Val Loss | Perplexity |
-|-----|-------------|----------|-----------|
-| run_01 | Transformer + BPE + FP16 (baseline) | 4.66 | 106.1 |
-| run_02a | Transformer + Byte + 4 heads (no LBD) | 2.31 | 10.1 |
-| run_02 | Transformer + Byte + LocalByteDecoder | 1.72 | 5.56 |
-| run_03 | MLGRU + Byte + FP16 | 1.87 | 6.49 |
-| run_04 | MLGRU + Byte + Ternary | 5.57 | 261.7 |
-| run_05 | + FPResidual | 5.55 | 257.7 |
-| run_05b | MLGRU kernel strict | 5.59 | 268.8 |
-| run_06 | + BolmoPatchEmbedding | 5.56 | 258.8 |
-| run_07 | + DeleteGate (CPU-1 complete) | 5.56 | 258.8 |
-| run_10 | + learned per-channel decay | 5.53 | 253.1 |
-| run_13 | Small 10M BPE model | 30.5 | — |
 > **Note**: ternary runs (04–10) were trained with only 2 tokens/param
 > (ablation budget). High perplexity reflects under-training, not architecture

+---
+language:
+  - en
+license: apache-2.0
+tags:
+  - pytorch
+  - text-generation
+  - 1.58bit
+  - ternary
+  - byte-level
+  - mlgru
+  - ablation
+library_name: pytorch
+pipeline_tag: text-generation
+model_type: custom
+---
 # CPU-1 Ablation Study — Ready-to-use Checkpoints
 Repo: `Cukinator/cpu1-ablations-final`
 Source (compact 2-bit): `Cukinator/cpu1-ablation-checkpoints`
+Each checkpoint is a standard PyTorch `.pt` file with **float32 weights** —
+no unpacking needed. Compatible with `train_ablation.py` from the
 [1.58bits repo](https://github.com/Cukinator/1.58bits).
+11 progressive ablation runs, each adding one component vs the previous.
+Most runs are **~50M parameters**; run_13 is a small 10M variant.
 ## Quick start
 ```python
 ## Ablation chain
+| Run | Architecture | Params | Val Loss | Perplexity |
+|-----|-------------|--------|----------|-----------|
+| run_01 | Transformer + BPE + FP16 (baseline) | 54.7M | 4.66 | 106.1 |
+| run_02a | Transformer + Byte + 4 heads (no LBD) | 38.5M | 2.31 | 10.1 |
+| run_02 | Transformer + Byte + LocalByteDecoder | 38.8M | 1.72 | 5.56 |
+| run_03 | MLGRU + Byte + FP16 | 38.8M | 1.87 | 6.49 |
+| run_04 | MLGRU + Byte + Ternary | 38.9M | 5.57 | 261.7 |
+| run_05 | + FPResidual | 39.0M | 5.55 | 257.7 |
+| run_05b | MLGRU kernel strict | 35.8M | 5.59 | 268.8 |
+| run_06 | + BolmoPatchEmbedding | 39.0M | 5.56 | 258.8 |
+| run_07 | + DeleteGate (CPU-1 complete) | 39.0M | 5.56 | 258.8 |
+| run_10 | + learned per-channel decay | 39.4M | 5.53 | 253.1 |
+| run_13 | Small 10M BPE model | 12.5M | 30.5 | — |
 > **Note**: ternary runs (04–10) were trained with only 2 tokens/param
 > (ablation budget). High perplexity reflects under-training, not architecture