Anurich commited on
Commit
08c5f05
Β·
verified Β·
1 Parent(s): a0147f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -42
README.md CHANGED
@@ -16,7 +16,7 @@ license: apache-2.0
16
  A compact **instruction-tuned** language model using **Looped Transformer + Value Residual Learning**.
17
  Trained with ChatML format for conversational AI and tool-calling capabilities.
18
 
19
- **#1 in its weight class** β€” outperforms all comparable models under 200M parameters.
20
 
21
  ## Quick Start
22
 
@@ -75,58 +75,59 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=False))
75
 
76
  ---
77
 
78
- ## Benchmark Comparison: Jeeves 95M vs Other Language Models
79
 
80
- ### Zero-Shot Performance
81
 
82
- | Model | Params | HellaSwag | ARC-Easy | ARC-Challenge | PIQA | WinoGrande | MMLU | TruthfulQA | GSM8K |
83
- |-------|--------|-----------|----------|---------------|------|------------|------|------------|-------|
84
- | **Jeeves** | **95M** | **33.5%** | **47.1%** | **26.8%** | **64.8%** | **52.4%** | **25.3%** | **25.1%** | **1.7%** |
85
- | Cerebras-GPT | 111M | 26.8% | 38.0% | 16.6% | 59.4% | 48.8% | - | - | - |
86
- | GPT-2 | 137M | 31.5% | 22.0%* | - | - | 50.4% | 25.8% | 40.7% | 0.7% |
87
- | Pythia | 160M | 29.3% | 45.2% | 18.1% | 62.7% | 51.9% | - | - | - |
88
- | Cerebras-GPT | 256M | 27.4% | 41.0% | 17.0% | 61.3% | 51.1% | - | - | - |
89
- | Pythia | 410M | 33.3% | 50.4% | 21.3% | 66.8% | 53.0% | - | - | - |
90
- | **Larger Models** | | | | | | | | | |
91
- | LLaMA | 7B | 76.1% | 70.1% | 47.6% | 76.5% | 70.1% | - | - | - |
92
- | GPT-3.5 | 175B | 85.5% | 85.2% | - | - | 81.6% | 70.0% | 47.0% | 57.1% |
93
- | GPT-4 | ~1.7T | 95.3% | 96.3% | - | - | 87.5% | 86.4% | 59.0% | 97.0% |
94
 
95
- ### Models Jeeves Outperforms
96
 
97
- **vs Cerebras-GPT 111M** (17% more params) β€” Jeeves wins on ALL benchmarks:
98
- - HellaSwag +6.7pp, ARC-Easy +9.1pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp
99
 
100
- **vs GPT-2 137M** (44% more params) β€” Jeeves wins on 4/6 comparable benchmarks:
101
- - HellaSwag +2.0pp, ARC-Easy +25.1pp, WinoGrande +2.0pp, GSM8K +1.0pp
102
 
103
- **vs Pythia 160M** (68% more params) β€” Jeeves wins on ALL benchmarks:
104
- - HellaSwag +4.2pp, ARC-Easy +1.9pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp
105
 
106
- ### Key Strengths
 
107
 
108
- - **PIQA (64.8%)** β€” strongest benchmark; better than all models under 256M params
109
- - **WinoGrande (52.4%)** β€” excellent commonsense reasoning for its size
110
- - **ARC-Easy (47.1%)** β€” beats Cerebras-111M by a large margin
111
- - **Parameter Efficiency** β€” achieves 111M-level performance with only 95M params (14% fewer)
112
- - **Punches above weight** β€” consistently beats models 1.7–4.3Γ— larger
113
 
114
- ### Competition Standing (sub-200M class)
115
 
116
- | Opponent | Result |
117
- |---|---|
118
- | Cerebras-GPT 111M | πŸ₯‡ Jeeves wins all |
119
- | GPT-2 137M | πŸ₯‡ Jeeves wins 4/6 |
120
- | Pythia 160M | πŸ₯‡ Jeeves wins all |
 
 
 
 
 
 
121
 
122
- **Jeeves 95M is the strongest model in its weight class.**
123
 
124
  ---
125
 
126
  ## Architecture
127
 
128
- Jeeves uses a **Looped Transformer** β€” a single middle block is run multiple times
129
- with input injection, giving effective depth much larger than the unique parameter count.
130
 
131
  ```
132
  Input β†’ [Early Layers 0-10] β†’ [Loop Block 11 Γ— 6 iters] β†’ [Late Layers 12-21] β†’ Output
@@ -134,15 +135,14 @@ Input β†’ [Early Layers 0-10] β†’ [Loop Block 11 Γ— 6 iters] β†’ [Late Layers 12
134
  +----------+ (input injection)
135
  ```
136
 
137
- Each loop iteration reuses the **same weights**, so the model gets 27 effective layers
138
- of processing with only 22 unique layer parameter sets.
139
 
140
  | Component | Value |
141
  |---|---|
142
- | Parameters | 96.3M |
 
143
  | Unique layers | 22 |
144
- | Effective depth | 27 (via looping) |
145
- | Loop config | block[11] Γ— 6 |
146
  | Value residual | βœ… |
147
  | Hidden dim | 576 |
148
  | FFN dim | 1,536 |
@@ -180,6 +180,7 @@ of processing with only 22 unique layer parameter sets.
180
  ## Limitations
181
 
182
  - **96M parameters** β€” this is a small research model, not a production system
 
183
  - May hallucinate facts, especially for complex math or rare knowledge
184
  - Repetition in longer outputs is common at this scale
185
  - Best suited for simple Q&A, short-form generation, and research into efficient architectures
 
16
  A compact **instruction-tuned** language model using **Looped Transformer + Value Residual Learning**.
17
  Trained with ChatML format for conversational AI and tool-calling capabilities.
18
 
19
+ **Most compute-efficient model in its weight class** β€” trained on only ~2B tokens, outperforms models trained on 20–150x more data.
20
 
21
  ## Quick Start
22
 
 
75
 
76
  ---
77
 
78
+ ## Benchmark Comparison
79
 
80
+ ### Zero-Shot Performance vs All Sub-200M Models
81
 
82
+ | Model | Params | Training Data | HellaSwag | ARC-Challenge | PIQA | WinoGrande | MMLU | GSM8K |
83
+ |-------|--------|---------------|-----------|---------------|------|------------|------|-------|
84
+ | **Jeeves** | **95M** | **~2B tokens** | **33.5%** | **26.8%** | **64.8%** | **52.4%** | **25.3%** | **1.7%** |
85
+ | Cerebras-GPT | 111M | ~2.6B tokens | 26.8% | 16.6% | 59.4% | 48.8% | β€” | β€” |
86
+ | OPT | 125M | 180B tokens | 29.2% | 22.9% | ~62% | 51.6% | 26.0% | 0.2% |
87
+ | GPT-Neo | 125M | 300B tokens | 30.3% | 22.9% | β€” | 51.8% | 26.0% | 0.3% |
88
+ | SmolLM | 135M | 600B tokens | 41.2% | β€” | 68.4% | 51.3% | 30.2% | 1.0% |
89
+ | SmolLM2 | 135M | 2T tokens | 42.1% | β€” | 68.4% | 51.3% | 31.5% | 1.4% |
90
+ | GPT-2 | 137M | ~40B tokens | 31.5% | β€” | β€” | 50.4% | 25.8% | 0.7% |
91
+ | Pythia | 160M | 300B tokens | 29.3% | 18.1% | 62.7% | 51.9% | β€” | β€” |
 
 
92
 
93
+ ### Models Jeeves Outperforms (with fewer parameters & less data)
94
 
95
+ **vs Cerebras-GPT 111M** (17% more params, similar data budget):
96
+ - Jeeves wins on ALL shared benchmarks: HellaSwag +6.7pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp
97
 
98
+ **vs OPT-125M** (32% more params, 90x more training data):
99
+ - Jeeves wins: HellaSwag +4.3pp, ARC-Challenge +3.9pp, PIQA +2.8pp, WinoGrande +0.8pp, GSM8K +1.5pp
100
 
101
+ **vs GPT-Neo 125M** (32% more params, 150x more training data):
102
+ - Jeeves wins: HellaSwag +3.2pp, WinoGrande +0.6pp, GSM8K +1.4pp
103
 
104
+ **vs GPT-2 137M** (44% more params, 20x more training data):
105
+ - Jeeves wins: HellaSwag +2.0pp, WinoGrande +2.0pp, GSM8K +1.0pp
106
 
107
+ **vs Pythia 160M** (68% more params, 150x more training data):
108
+ - Jeeves wins on ALL shared benchmarks: HellaSwag +4.2pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp
 
 
 
109
 
110
+ ### Models That Beat Jeeves
111
 
112
+ **SmolLM-135M** and **SmolLM2-135M** outperform Jeeves on HellaSwag, PIQA, and MMLU β€” but were trained on **600B and 2T tokens** respectively (300–1000x more data) using **64 H100 GPUs**. Jeeves was trained on ~2B tokens.
113
+
114
+ ### Training Efficiency
115
+
116
+ | Model | Params | Training Tokens | HellaSwag per B tokens |
117
+ |-------|--------|-----------------|------------------------|
118
+ | **Jeeves** | **95M** | **~2B** | **16.75** |
119
+ | OPT-125M | 125M | 180B | 0.16 |
120
+ | GPT-Neo 125M | 125M | 300B | 0.10 |
121
+ | SmolLM2-135M | 135M | 2,000B | 0.02 |
122
+ | Pythia 160M | 160M | 300B | 0.10 |
123
 
124
+ Jeeves achieves **100–800x better benchmark-per-token efficiency** than comparable models, demonstrating that architecture innovation (looped transformers + value residual learning) can dramatically reduce the data and compute needed to reach competitive performance.
125
 
126
  ---
127
 
128
  ## Architecture
129
 
130
+ Jeeves uses a **Looped Transformer** β€” a single middle block is run multiple times with input injection, giving effective depth much larger than the unique parameter count.
 
131
 
132
  ```
133
  Input β†’ [Early Layers 0-10] β†’ [Loop Block 11 Γ— 6 iters] β†’ [Late Layers 12-21] β†’ Output
 
135
  +----------+ (input injection)
136
  ```
137
 
138
+ Each loop iteration reuses the **same weights**, so the model gets 27 effective layers of processing with only 22 unique layer parameter sets.
 
139
 
140
  | Component | Value |
141
  |---|---|
142
+ | Parameters | 96.3M (unique) |
143
+ | Effective depth | 27 layers (via looping) |
144
  | Unique layers | 22 |
145
+ | Loop config | block[11] Γ— 6 iterations |
 
146
  | Value residual | βœ… |
147
  | Hidden dim | 576 |
148
  | FFN dim | 1,536 |
 
180
  ## Limitations
181
 
182
  - **96M parameters** β€” this is a small research model, not a production system
183
+ - SmolLM/SmolLM2 (135M) achieve higher absolute scores with 300–1000x more training data
184
  - May hallucinate facts, especially for complex math or rare knowledge
185
  - Repetition in longer outputs is common at this scale
186
  - Best suited for simple Q&A, short-form generation, and research into efficient architectures