Anurich
/

Jeeves-Small-95M

@@ -16,7 +16,7 @@ license: apache-2.0
 A compact **instruction-tuned** language model using **Looped Transformer + Value Residual Learning**.
 Trained with ChatML format for conversational AI and tool-calling capabilities.
-**#1 in its weight class** — outperforms all comparable models under 200M parameters.
 ## Quick Start
@@ -75,58 +75,59 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=False))
 ---
-## Benchmark Comparison: Jeeves 95M vs Other Language Models
-### Zero-Shot Performance
-| Model | Params | HellaSwag | ARC-Easy | ARC-Challenge | PIQA | WinoGrande | MMLU | TruthfulQA | GSM8K |
-|-------|--------|-----------|----------|---------------|------|------------|------|------------|-------|
-| **Jeeves** | **95M** | **33.5%** | **47.1%** | **26.8%** | **64.8%** | **52.4%** | **25.3%** | **25.1%** | **1.7%** |
-| Cerebras-GPT | 111M | 26.8% | 38.0% | 16.6% | 59.4% | 48.8% | - | - | - |
-| GPT-2 | 137M | 31.5% | 22.0%* | - | - | 50.4% | 25.8% | 40.7% | 0.7% |
-| Pythia | 160M | 29.3% | 45.2% | 18.1% | 62.7% | 51.9% | - | - | - |
-| Cerebras-GPT | 256M | 27.4% | 41.0% | 17.0% | 61.3% | 51.1% | - | - | - |
-| Pythia | 410M | 33.3% | 50.4% | 21.3% | 66.8% | 53.0% | - | - | - |
-| **Larger Models** | | | | | | | | | |
-| LLaMA | 7B | 76.1% | 70.1% | 47.6% | 76.5% | 70.1% | - | - | - |
-| GPT-3.5 | 175B | 85.5% | 85.2% | - | - | 81.6% | 70.0% | 47.0% | 57.1% |
-| GPT-4 | ~1.7T | 95.3% | 96.3% | - | - | 87.5% | 86.4% | 59.0% | 97.0% |
-### Models Jeeves Outperforms
-**vs Cerebras-GPT 111M** (17% more params) — Jeeves wins on ALL benchmarks:
-- HellaSwag +6.7pp, ARC-Easy +9.1pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp
-**vs GPT-2 137M** (44% more params) — Jeeves wins on 4/6 comparable benchmarks:
-- HellaSwag +2.0pp, ARC-Easy +25.1pp, WinoGrande +2.0pp, GSM8K +1.0pp
-**vs Pythia 160M** (68% more params) — Jeeves wins on ALL benchmarks:
-- HellaSwag +4.2pp, ARC-Easy +1.9pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp
-### Key Strengths
-- **PIQA (64.8%)** — strongest benchmark; better than all models under 256M params
-- **WinoGrande (52.4%)** — excellent commonsense reasoning for its size
-- **ARC-Easy (47.1%)** — beats Cerebras-111M by a large margin
-- **Parameter Efficiency** — achieves 111M-level performance with only 95M params (14% fewer)
-- **Punches above weight** — consistently beats models 1.7–4.3× larger
-### Competition Standing (sub-200M class)
-| Opponent | Result |
-|---|---|
-| Cerebras-GPT 111M | 🥇 Jeeves wins all |
-| GPT-2 137M | 🥇 Jeeves wins 4/6 |
-| Pythia 160M | 🥇 Jeeves wins all |
-**Jeeves 95M is the strongest model in its weight class.**
 ---
 ## Architecture
-Jeeves uses a **Looped Transformer** — a single middle block is run multiple times
-with input injection, giving effective depth much larger than the unique parameter count.
 ```
 Input → [Early Layers 0-10] → [Loop Block 11 × 6 iters] → [Late Layers 12-21] → Output
@@ -134,15 +135,14 @@ Input → [Early Layers 0-10] → [Loop Block 11 × 6 iters] → [Late Layers 12
                                       +----------+  (input injection)
 ```
-Each loop iteration reuses the **same weights**, so the model gets 27 effective layers
-of processing with only 22 unique layer parameter sets.
 | Component | Value |
 |---|---|
-| Parameters | 96.3M |
 | Unique layers | 22 |
-| Effective depth | 27 (via looping) |
-| Loop config | block[11] × 6 |
 | Value residual | ✅ |
 | Hidden dim | 576 |
 | FFN dim | 1,536 |
@@ -180,6 +180,7 @@ of processing with only 22 unique layer parameter sets.
 ## Limitations
 - **96M parameters** — this is a small research model, not a production system
 - May hallucinate facts, especially for complex math or rare knowledge
 - Repetition in longer outputs is common at this scale
 - Best suited for simple Q&A, short-form generation, and research into efficient architectures

 A compact **instruction-tuned** language model using **Looped Transformer + Value Residual Learning**.
 Trained with ChatML format for conversational AI and tool-calling capabilities.
+**Most compute-efficient model in its weight class** — trained on only ~2B tokens, outperforms models trained on 20–150x more data.
 ## Quick Start
 ---
+## Benchmark Comparison
+### Zero-Shot Performance vs All Sub-200M Models
+| Model | Params | Training Data | HellaSwag | ARC-Challenge | PIQA | WinoGrande | MMLU | GSM8K |
+|-------|--------|---------------|-----------|---------------|------|------------|------|-------|
+| **Jeeves** | **95M** | **~2B tokens** | **33.5%** | **26.8%** | **64.8%** | **52.4%** | **25.3%** | **1.7%** |
+| Cerebras-GPT | 111M | ~2.6B tokens | 26.8% | 16.6% | 59.4% | 48.8% | — | — |
+| OPT | 125M | 180B tokens | 29.2% | 22.9% | ~62% | 51.6% | 26.0% | 0.2% |
+| GPT-Neo | 125M | 300B tokens | 30.3% | 22.9% | — | 51.8% | 26.0% | 0.3% |
+| SmolLM | 135M | 600B tokens | 41.2% | — | 68.4% | 51.3% | 30.2% | 1.0% |
+| SmolLM2 | 135M | 2T tokens | 42.1% | — | 68.4% | 51.3% | 31.5% | 1.4% |
+| GPT-2 | 137M | ~40B tokens | 31.5% | — | — | 50.4% | 25.8% | 0.7% |
+| Pythia | 160M | 300B tokens | 29.3% | 18.1% | 62.7% | 51.9% | — | — |
+### Models Jeeves Outperforms (with fewer parameters & less data)
+**vs Cerebras-GPT 111M** (17% more params, similar data budget):
+- Jeeves wins on ALL shared benchmarks: HellaSwag +6.7pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp
+**vs OPT-125M** (32% more params, 90x more training data):
+- Jeeves wins: HellaSwag +4.3pp, ARC-Challenge +3.9pp, PIQA +2.8pp, WinoGrande +0.8pp, GSM8K +1.5pp
+**vs GPT-Neo 125M** (32% more params, 150x more training data):
+- Jeeves wins: HellaSwag +3.2pp, WinoGrande +0.6pp, GSM8K +1.4pp
+**vs GPT-2 137M** (44% more params, 20x more training data):
+- Jeeves wins: HellaSwag +2.0pp, WinoGrande +2.0pp, GSM8K +1.0pp
+**vs Pythia 160M** (68% more params, 150x more training data):
+- Jeeves wins on ALL shared benchmarks: HellaSwag +4.2pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp
+### Models That Beat Jeeves
+**SmolLM-135M** and **SmolLM2-135M** outperform Jeeves on HellaSwag, PIQA, and MMLU — but were trained on **600B and 2T tokens** respectively (300–1000x more data) using **64 H100 GPUs**. Jeeves was trained on ~2B tokens.
+### Training Efficiency
+| Model | Params | Training Tokens | HellaSwag per B tokens |
+|-------|--------|-----------------|------------------------|
+| **Jeeves** | **95M** | **~2B** | **16.75** |
+| OPT-125M | 125M | 180B | 0.16 |
+| GPT-Neo 125M | 125M | 300B | 0.10 |
+| SmolLM2-135M | 135M | 2,000B | 0.02 |
+| Pythia 160M | 160M | 300B | 0.10 |
+Jeeves achieves **100–800x better benchmark-per-token efficiency** than comparable models, demonstrating that architecture innovation (looped transformers + value residual learning) can dramatically reduce the data and compute needed to reach competitive performance.
 ---
 ## Architecture
+Jeeves uses a **Looped Transformer** — a single middle block is run multiple times with input injection, giving effective depth much larger than the unique parameter count.
 ```
 Input → [Early Layers 0-10] → [Loop Block 11 × 6 iters] → [Late Layers 12-21] → Output
                                       +----------+  (input injection)
 ```
+Each loop iteration reuses the **same weights**, so the model gets 27 effective layers of processing with only 22 unique layer parameter sets.
 | Component | Value |
 |---|---|
+| Parameters | 96.3M (unique) |
+| Effective depth | 27 layers (via looping) |
 | Unique layers | 22 |
+| Loop config | block[11] × 6 iterations |
 | Value residual | ✅ |
 | Hidden dim | 576 |
 | FFN dim | 1,536 |
 ## Limitations
 - **96M parameters** — this is a small research model, not a production system
+- SmolLM/SmolLM2 (135M) achieve higher absolute scores with 300–1000x more training data
 - May hallucinate facts, especially for complex math or rare knowledge
 - Repetition in longer outputs is common at this scale
 - Best suited for simple Q&A, short-form generation, and research into efficient architectures