Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ license: apache-2.0
|
|
| 16 |
A compact **instruction-tuned** language model using **Looped Transformer + Value Residual Learning**.
|
| 17 |
Trained with ChatML format for conversational AI and tool-calling capabilities.
|
| 18 |
|
| 19 |
-
**
|
| 20 |
|
| 21 |
## Quick Start
|
| 22 |
|
|
@@ -75,58 +75,59 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=False))
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
-
## Benchmark Comparison
|
| 79 |
|
| 80 |
-
### Zero-Shot Performance
|
| 81 |
|
| 82 |
-
| Model | Params |
|
| 83 |
-
|-------|--------|-----------
|
| 84 |
-
| **Jeeves** | **95M** | **
|
| 85 |
-
| Cerebras-GPT | 111M |
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
|
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
| GPT-3.5 | 175B | 85.5% | 85.2% | - | - | 81.6% | 70.0% | 47.0% | 57.1% |
|
| 93 |
-
| GPT-4 | ~1.7T | 95.3% | 96.3% | - | - | 87.5% | 86.4% | 59.0% | 97.0% |
|
| 94 |
|
| 95 |
-
### Models Jeeves Outperforms
|
| 96 |
|
| 97 |
-
**vs Cerebras-GPT 111M** (17% more params
|
| 98 |
-
- HellaSwag +6.7pp, ARC-
|
| 99 |
|
| 100 |
-
**vs
|
| 101 |
-
- HellaSwag +
|
| 102 |
|
| 103 |
-
**vs
|
| 104 |
-
- HellaSwag +
|
| 105 |
|
| 106 |
-
|
|
|
|
| 107 |
|
| 108 |
-
|
| 109 |
-
-
|
| 110 |
-
- **ARC-Easy (47.1%)** β beats Cerebras-111M by a large margin
|
| 111 |
-
- **Parameter Efficiency** β achieves 111M-level performance with only 95M params (14% fewer)
|
| 112 |
-
- **Punches above weight** β consistently beats models 1.7β4.3Γ larger
|
| 113 |
|
| 114 |
-
###
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
-
|
| 119 |
-
|
| 120 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 121 |
|
| 122 |
-
**
|
| 123 |
|
| 124 |
---
|
| 125 |
|
| 126 |
## Architecture
|
| 127 |
|
| 128 |
-
Jeeves uses a **Looped Transformer** β a single middle block is run multiple times
|
| 129 |
-
with input injection, giving effective depth much larger than the unique parameter count.
|
| 130 |
|
| 131 |
```
|
| 132 |
Input β [Early Layers 0-10] β [Loop Block 11 Γ 6 iters] β [Late Layers 12-21] β Output
|
|
@@ -134,15 +135,14 @@ Input β [Early Layers 0-10] β [Loop Block 11 Γ 6 iters] β [Late Layers 12
|
|
| 134 |
+----------+ (input injection)
|
| 135 |
```
|
| 136 |
|
| 137 |
-
Each loop iteration reuses the **same weights**, so the model gets 27 effective layers
|
| 138 |
-
of processing with only 22 unique layer parameter sets.
|
| 139 |
|
| 140 |
| Component | Value |
|
| 141 |
|---|---|
|
| 142 |
-
| Parameters | 96.3M |
|
|
|
|
| 143 |
| Unique layers | 22 |
|
| 144 |
-
|
|
| 145 |
-
| Loop config | block[11] Γ 6 |
|
| 146 |
| Value residual | β
|
|
| 147 |
| Hidden dim | 576 |
|
| 148 |
| FFN dim | 1,536 |
|
|
@@ -180,6 +180,7 @@ of processing with only 22 unique layer parameter sets.
|
|
| 180 |
## Limitations
|
| 181 |
|
| 182 |
- **96M parameters** β this is a small research model, not a production system
|
|
|
|
| 183 |
- May hallucinate facts, especially for complex math or rare knowledge
|
| 184 |
- Repetition in longer outputs is common at this scale
|
| 185 |
- Best suited for simple Q&A, short-form generation, and research into efficient architectures
|
|
|
|
| 16 |
A compact **instruction-tuned** language model using **Looped Transformer + Value Residual Learning**.
|
| 17 |
Trained with ChatML format for conversational AI and tool-calling capabilities.
|
| 18 |
|
| 19 |
+
**Most compute-efficient model in its weight class** β trained on only ~2B tokens, outperforms models trained on 20β150x more data.
|
| 20 |
|
| 21 |
## Quick Start
|
| 22 |
|
|
|
|
| 75 |
|
| 76 |
---
|
| 77 |
|
| 78 |
+
## Benchmark Comparison
|
| 79 |
|
| 80 |
+
### Zero-Shot Performance vs All Sub-200M Models
|
| 81 |
|
| 82 |
+
| Model | Params | Training Data | HellaSwag | ARC-Challenge | PIQA | WinoGrande | MMLU | GSM8K |
|
| 83 |
+
|-------|--------|---------------|-----------|---------------|------|------------|------|-------|
|
| 84 |
+
| **Jeeves** | **95M** | **~2B tokens** | **33.5%** | **26.8%** | **64.8%** | **52.4%** | **25.3%** | **1.7%** |
|
| 85 |
+
| Cerebras-GPT | 111M | ~2.6B tokens | 26.8% | 16.6% | 59.4% | 48.8% | β | β |
|
| 86 |
+
| OPT | 125M | 180B tokens | 29.2% | 22.9% | ~62% | 51.6% | 26.0% | 0.2% |
|
| 87 |
+
| GPT-Neo | 125M | 300B tokens | 30.3% | 22.9% | β | 51.8% | 26.0% | 0.3% |
|
| 88 |
+
| SmolLM | 135M | 600B tokens | 41.2% | β | 68.4% | 51.3% | 30.2% | 1.0% |
|
| 89 |
+
| SmolLM2 | 135M | 2T tokens | 42.1% | β | 68.4% | 51.3% | 31.5% | 1.4% |
|
| 90 |
+
| GPT-2 | 137M | ~40B tokens | 31.5% | β | β | 50.4% | 25.8% | 0.7% |
|
| 91 |
+
| Pythia | 160M | 300B tokens | 29.3% | 18.1% | 62.7% | 51.9% | β | β |
|
|
|
|
|
|
|
| 92 |
|
| 93 |
+
### Models Jeeves Outperforms (with fewer parameters & less data)
|
| 94 |
|
| 95 |
+
**vs Cerebras-GPT 111M** (17% more params, similar data budget):
|
| 96 |
+
- Jeeves wins on ALL shared benchmarks: HellaSwag +6.7pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp
|
| 97 |
|
| 98 |
+
**vs OPT-125M** (32% more params, 90x more training data):
|
| 99 |
+
- Jeeves wins: HellaSwag +4.3pp, ARC-Challenge +3.9pp, PIQA +2.8pp, WinoGrande +0.8pp, GSM8K +1.5pp
|
| 100 |
|
| 101 |
+
**vs GPT-Neo 125M** (32% more params, 150x more training data):
|
| 102 |
+
- Jeeves wins: HellaSwag +3.2pp, WinoGrande +0.6pp, GSM8K +1.4pp
|
| 103 |
|
| 104 |
+
**vs GPT-2 137M** (44% more params, 20x more training data):
|
| 105 |
+
- Jeeves wins: HellaSwag +2.0pp, WinoGrande +2.0pp, GSM8K +1.0pp
|
| 106 |
|
| 107 |
+
**vs Pythia 160M** (68% more params, 150x more training data):
|
| 108 |
+
- Jeeves wins on ALL shared benchmarks: HellaSwag +4.2pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
+
### Models That Beat Jeeves
|
| 111 |
|
| 112 |
+
**SmolLM-135M** and **SmolLM2-135M** outperform Jeeves on HellaSwag, PIQA, and MMLU β but were trained on **600B and 2T tokens** respectively (300β1000x more data) using **64 H100 GPUs**. Jeeves was trained on ~2B tokens.
|
| 113 |
+
|
| 114 |
+
### Training Efficiency
|
| 115 |
+
|
| 116 |
+
| Model | Params | Training Tokens | HellaSwag per B tokens |
|
| 117 |
+
|-------|--------|-----------------|------------------------|
|
| 118 |
+
| **Jeeves** | **95M** | **~2B** | **16.75** |
|
| 119 |
+
| OPT-125M | 125M | 180B | 0.16 |
|
| 120 |
+
| GPT-Neo 125M | 125M | 300B | 0.10 |
|
| 121 |
+
| SmolLM2-135M | 135M | 2,000B | 0.02 |
|
| 122 |
+
| Pythia 160M | 160M | 300B | 0.10 |
|
| 123 |
|
| 124 |
+
Jeeves achieves **100β800x better benchmark-per-token efficiency** than comparable models, demonstrating that architecture innovation (looped transformers + value residual learning) can dramatically reduce the data and compute needed to reach competitive performance.
|
| 125 |
|
| 126 |
---
|
| 127 |
|
| 128 |
## Architecture
|
| 129 |
|
| 130 |
+
Jeeves uses a **Looped Transformer** β a single middle block is run multiple times with input injection, giving effective depth much larger than the unique parameter count.
|
|
|
|
| 131 |
|
| 132 |
```
|
| 133 |
Input β [Early Layers 0-10] β [Loop Block 11 Γ 6 iters] β [Late Layers 12-21] β Output
|
|
|
|
| 135 |
+----------+ (input injection)
|
| 136 |
```
|
| 137 |
|
| 138 |
+
Each loop iteration reuses the **same weights**, so the model gets 27 effective layers of processing with only 22 unique layer parameter sets.
|
|
|
|
| 139 |
|
| 140 |
| Component | Value |
|
| 141 |
|---|---|
|
| 142 |
+
| Parameters | 96.3M (unique) |
|
| 143 |
+
| Effective depth | 27 layers (via looping) |
|
| 144 |
| Unique layers | 22 |
|
| 145 |
+
| Loop config | block[11] Γ 6 iterations |
|
|
|
|
| 146 |
| Value residual | β
|
|
| 147 |
| Hidden dim | 576 |
|
| 148 |
| FFN dim | 1,536 |
|
|
|
|
| 180 |
## Limitations
|
| 181 |
|
| 182 |
- **96M parameters** β this is a small research model, not a production system
|
| 183 |
+
- SmolLM/SmolLM2 (135M) achieve higher absolute scores with 300β1000x more training data
|
| 184 |
- May hallucinate facts, especially for complex math or rare knowledge
|
| 185 |
- Repetition in longer outputs is common at this scale
|
| 186 |
- Best suited for simple Q&A, short-form generation, and research into efficient architectures
|