Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Jeeves (96M) β Looped Transformer
A compact instruction-tuned language model using Looped Transformer + Value Residual Learning. Trained with ChatML format for conversational AI and tool-calling capabilities.
Most compute-efficient model in its weight class β trained on only ~2B tokens, outperforms models trained on 20β150x more data.
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True)
model.eval()
# Use ChatML format (recommended for best results)
prompt = "<|im_start|>user\nWhat is photosynthesis?<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Note:
trust_remote_code=Trueis required.
Chat Format (ChatML)
This model was fine-tuned using ChatML format. For best results, structure prompts like:
<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant
Multi-turn Conversation
conversation = """<|im_start|>user
What is the speed of light?<|im_end|>
<|im_start|>assistant
The speed of light is approximately 299,792 kilometers per second.<|im_end|>
<|im_start|>user
How long does it take light to reach Earth from the Sun?<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Example Outputs
| Prompt | Response |
|---|---|
| What is photosynthesis? | Photosynthesis is the process by which plants and other organisms use sunlight, water, and carbon dioxide to produce energy and produce oxygen. |
| What is the speed of light? | The speed of light is approximately 299,792 kilometers per second. |
| What are the three states of matter? | The three states of matter are: 1. Solid 2. Liquid 3. Gas. |
| How does a vaccine work? | A vaccine is a biological agent that is designed to protect the body from harmful pathogens, such as bacteria, viruses, and parasites. |
Benchmark Comparison
Zero-Shot Performance vs All Sub-200M Models
| Model | Params | Training Data | HellaSwag | ARC-Challenge | PIQA | WinoGrande | MMLU | GSM8K |
|---|---|---|---|---|---|---|---|---|
| Jeeves | 95M | ~2B tokens | 33.5% | 26.8% | 64.8% | 52.4% | 25.3% | 1.7% |
| Cerebras-GPT | 111M | ~2.6B tokens | 26.8% | 16.6% | 59.4% | 48.8% | β | β |
| OPT | 125M | 180B tokens | 29.2% | 22.9% | ~62% | 51.6% | 26.0% | 0.2% |
| GPT-Neo | 125M | 300B tokens | 30.3% | 22.9% | β | 51.8% | 26.0% | 0.3% |
| SmolLM | 135M | 600B tokens | 41.2% | β | 68.4% | 51.3% | 30.2% | 1.0% |
| SmolLM2 | 135M | 2T tokens | 42.1% | β | 68.4% | 51.3% | 31.5% | 1.4% |
| GPT-2 | 137M | ~40B tokens | 31.5% | β | β | 50.4% | 25.8% | 0.7% |
| Pythia | 160M | 300B tokens | 29.3% | 18.1% | 62.7% | 51.9% | β | β |
Models Jeeves Outperforms (with fewer parameters & less data)
vs Cerebras-GPT 111M (17% more params, similar data budget):
- Jeeves wins on ALL shared benchmarks: HellaSwag +6.7pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp
vs OPT-125M (32% more params, 90x more training data):
- Jeeves wins: HellaSwag +4.3pp, ARC-Challenge +3.9pp, PIQA +2.8pp, WinoGrande +0.8pp, GSM8K +1.5pp
vs GPT-Neo 125M (32% more params, 150x more training data):
- Jeeves wins: HellaSwag +3.2pp, WinoGrande +0.6pp, GSM8K +1.4pp
vs GPT-2 137M (44% more params, 20x more training data):
- Jeeves wins: HellaSwag +2.0pp, WinoGrande +2.0pp, GSM8K +1.0pp
vs Pythia 160M (68% more params, 150x more training data):
- Jeeves wins on ALL shared benchmarks: HellaSwag +4.2pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp
Models That Beat Jeeves
SmolLM-135M and SmolLM2-135M outperform Jeeves on HellaSwag, PIQA, and MMLU β but were trained on 600B and 2T tokens respectively (300β1000x more data) using 64 H100 GPUs. Jeeves was trained on ~2B tokens.
Training Efficiency
| Model | Params | Training Tokens | HellaSwag per B tokens |
|---|---|---|---|
| Jeeves | 95M | ~2B | 16.75 |
| OPT-125M | 125M | 180B | 0.16 |
| GPT-Neo 125M | 125M | 300B | 0.10 |
| SmolLM2-135M | 135M | 2,000B | 0.02 |
| Pythia 160M | 160M | 300B | 0.10 |
Jeeves achieves 100β800x better benchmark-per-token efficiency than comparable models, demonstrating that architecture innovation (looped transformers + value residual learning) can dramatically reduce the data and compute needed to reach competitive performance.
Architecture
Jeeves uses a Looped Transformer β a single middle block is run multiple times with input injection, giving effective depth much larger than the unique parameter count.
Input β [Early Layers 0-10] β [Loop Block 11 Γ 6 iters] β [Late Layers 12-21] β Output
β |
+----------+ (input injection)
Each loop iteration reuses the same weights, so the model gets 27 effective layers of processing with only 22 unique layer parameter sets.
| Component | Value |
|---|---|
| Parameters | 96.3M (unique) |
| Effective depth | 27 layers (via looping) |
| Unique layers | 22 |
| Loop config | block[11] Γ 6 iterations |
| Value residual | β |
| Hidden dim | 576 |
| FFN dim | 1,536 |
| Attention heads | 9 (Q) / 3 (KV) |
| Vocab size | 32,000 |
| Max seq length | 1,024 |
Key Innovations
- Looped Transformer (arXiv 2311.12424) β weight sharing via block looping for parameter efficiency
- Value Residual Learning (arXiv 2410.17897) β first-layer value residuals prevent representation collapse in deep/looped networks
- Input Injection β adds pre-loop hidden state back during each loop iteration for training stability
- Grouped Query Attention β 9 query heads with 3 key-value heads for efficient inference
Training Pipeline
| Stage | Data | Details |
|---|---|---|
| Pre-training | ~2B tokens | FineWeb-Edu, Cosmopedia, Python-Edu, OpenWebMath, StarCoder |
| Chat SFT | ChatML conversations | Instruction tuning for conversational ability |
| Tool SFT | Function-calling data | JSON tool calls with <|tool_call|> and <|tool_result|> markers |
Special Tokens
| Token | ID | Purpose |
|---|---|---|
<pad> |
0 | Padding |
<s> |
1 | Beginning of sequence |
</s> |
2 | End of sequence |
| `< | im_start | >` |
| `< | im_end | >` |
| `< | tool_call | >` |
| `< | tool_result | >` |
Limitations
- 96M parameters β this is a small research model, not a production system
- SmolLM/SmolLM2 (135M) achieve higher absolute scores with 300β1000x more training data
- May hallucinate facts, especially for complex math or rare knowledge
- Repetition in longer outputs is common at this scale
- Best suited for simple Q&A, short-form generation, and research into efficient architectures
License
Apache 2.0
Citation
@misc{jeeves2026,
title={Jeeves: Efficient Language Modeling with Looped Transformers and Value Residual Learning},
author={Anurich},
year={2026},
url={https://huggingface.co/Anurich/Jeeves-Small-95M}
}
- Downloads last month
- 254