Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Jeeves (96M) β€” Looped Transformer

A compact instruction-tuned language model using Looped Transformer + Value Residual Learning. Trained with ChatML format for conversational AI and tool-calling capabilities.

Most compute-efficient model in its weight class β€” trained on only ~2B tokens, outperforms models trained on 20–150x more data.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True)
model.eval()

# Use ChatML format (recommended for best results)
prompt = "<|im_start|>user\nWhat is photosynthesis?<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Note: trust_remote_code=True is required.

Chat Format (ChatML)

This model was fine-tuned using ChatML format. For best results, structure prompts like:

<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant

Multi-turn Conversation

conversation = """<|im_start|>user
What is the speed of light?<|im_end|>
<|im_start|>assistant
The speed of light is approximately 299,792 kilometers per second.<|im_end|>
<|im_start|>user
How long does it take light to reach Earth from the Sun?<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Example Outputs

Prompt Response
What is photosynthesis? Photosynthesis is the process by which plants and other organisms use sunlight, water, and carbon dioxide to produce energy and produce oxygen.
What is the speed of light? The speed of light is approximately 299,792 kilometers per second.
What are the three states of matter? The three states of matter are: 1. Solid 2. Liquid 3. Gas.
How does a vaccine work? A vaccine is a biological agent that is designed to protect the body from harmful pathogens, such as bacteria, viruses, and parasites.

Benchmark Comparison

Zero-Shot Performance vs All Sub-200M Models

Model Params Training Data HellaSwag ARC-Challenge PIQA WinoGrande MMLU GSM8K
Jeeves 95M ~2B tokens 33.5% 26.8% 64.8% 52.4% 25.3% 1.7%
Cerebras-GPT 111M ~2.6B tokens 26.8% 16.6% 59.4% 48.8% β€” β€”
OPT 125M 180B tokens 29.2% 22.9% ~62% 51.6% 26.0% 0.2%
GPT-Neo 125M 300B tokens 30.3% 22.9% β€” 51.8% 26.0% 0.3%
SmolLM 135M 600B tokens 41.2% β€” 68.4% 51.3% 30.2% 1.0%
SmolLM2 135M 2T tokens 42.1% β€” 68.4% 51.3% 31.5% 1.4%
GPT-2 137M ~40B tokens 31.5% β€” β€” 50.4% 25.8% 0.7%
Pythia 160M 300B tokens 29.3% 18.1% 62.7% 51.9% β€” β€”

Models Jeeves Outperforms (with fewer parameters & less data)

vs Cerebras-GPT 111M (17% more params, similar data budget):

  • Jeeves wins on ALL shared benchmarks: HellaSwag +6.7pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp

vs OPT-125M (32% more params, 90x more training data):

  • Jeeves wins: HellaSwag +4.3pp, ARC-Challenge +3.9pp, PIQA +2.8pp, WinoGrande +0.8pp, GSM8K +1.5pp

vs GPT-Neo 125M (32% more params, 150x more training data):

  • Jeeves wins: HellaSwag +3.2pp, WinoGrande +0.6pp, GSM8K +1.4pp

vs GPT-2 137M (44% more params, 20x more training data):

  • Jeeves wins: HellaSwag +2.0pp, WinoGrande +2.0pp, GSM8K +1.0pp

vs Pythia 160M (68% more params, 150x more training data):

  • Jeeves wins on ALL shared benchmarks: HellaSwag +4.2pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp

Models That Beat Jeeves

SmolLM-135M and SmolLM2-135M outperform Jeeves on HellaSwag, PIQA, and MMLU β€” but were trained on 600B and 2T tokens respectively (300–1000x more data) using 64 H100 GPUs. Jeeves was trained on ~2B tokens.

Training Efficiency

Model Params Training Tokens HellaSwag per B tokens
Jeeves 95M ~2B 16.75
OPT-125M 125M 180B 0.16
GPT-Neo 125M 125M 300B 0.10
SmolLM2-135M 135M 2,000B 0.02
Pythia 160M 160M 300B 0.10

Jeeves achieves 100–800x better benchmark-per-token efficiency than comparable models, demonstrating that architecture innovation (looped transformers + value residual learning) can dramatically reduce the data and compute needed to reach competitive performance.


Architecture

Jeeves uses a Looped Transformer β€” a single middle block is run multiple times with input injection, giving effective depth much larger than the unique parameter count.

Input β†’ [Early Layers 0-10] β†’ [Loop Block 11 Γ— 6 iters] β†’ [Late Layers 12-21] β†’ Output
                                      ↑          |
                                      +----------+  (input injection)

Each loop iteration reuses the same weights, so the model gets 27 effective layers of processing with only 22 unique layer parameter sets.

Component Value
Parameters 96.3M (unique)
Effective depth 27 layers (via looping)
Unique layers 22
Loop config block[11] Γ— 6 iterations
Value residual βœ…
Hidden dim 576
FFN dim 1,536
Attention heads 9 (Q) / 3 (KV)
Vocab size 32,000
Max seq length 1,024

Key Innovations

  • Looped Transformer (arXiv 2311.12424) β€” weight sharing via block looping for parameter efficiency
  • Value Residual Learning (arXiv 2410.17897) β€” first-layer value residuals prevent representation collapse in deep/looped networks
  • Input Injection β€” adds pre-loop hidden state back during each loop iteration for training stability
  • Grouped Query Attention β€” 9 query heads with 3 key-value heads for efficient inference

Training Pipeline

Stage Data Details
Pre-training ~2B tokens FineWeb-Edu, Cosmopedia, Python-Edu, OpenWebMath, StarCoder
Chat SFT ChatML conversations Instruction tuning for conversational ability
Tool SFT Function-calling data JSON tool calls with <|tool_call|> and <|tool_result|> markers

Special Tokens

Token ID Purpose
<pad> 0 Padding
<s> 1 Beginning of sequence
</s> 2 End of sequence
`< im_start >`
`< im_end >`
`< tool_call >`
`< tool_result >`

Limitations

  • 96M parameters β€” this is a small research model, not a production system
  • SmolLM/SmolLM2 (135M) achieve higher absolute scores with 300–1000x more training data
  • May hallucinate facts, especially for complex math or rare knowledge
  • Repetition in longer outputs is common at this scale
  • Best suited for simple Q&A, short-form generation, and research into efficient architectures

License

Apache 2.0

Citation

@misc{jeeves2026,
  title={Jeeves: Efficient Language Modeling with Looped Transformers and Value Residual Learning},
  author={Anurich},
  year={2026},
  url={https://huggingface.co/Anurich/Jeeves-Small-95M}
}
Downloads last month
254
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Anurich/Jeeves-Small-95M