Instructions to use Anurich/Jeeves-Small-95M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Anurich/Jeeves-Small-95M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Anurich/Jeeves-Small-95M", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Anurich/Jeeves-Small-95M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Anurich/Jeeves-Small-95M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-95M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Anurich/Jeeves-Small-95M

SGLang

How to use Anurich/Jeeves-Small-95M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Anurich/Jeeves-Small-95M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-95M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Anurich/Jeeves-Small-95M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-95M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Anurich/Jeeves-Small-95M with Docker Model Runner:
```
docker model run hf.co/Anurich/Jeeves-Small-95M
```

Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Jeeves (96M) — Looped Transformer

A compact instruction-tuned language model using Looped Transformer + Value Residual Learning. Trained with ChatML format for conversational AI and tool-calling capabilities.

Most compute-efficient model in its weight class — trained on only ~2B tokens, outperforms models trained on 20–150x more data.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-95M", trust_remote_code=True)
model.eval()

# Use ChatML format (recommended for best results)
prompt = "<|im_start|>user\nWhat is photosynthesis?<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Note: trust_remote_code=True is required.

Chat Format (ChatML)

This model was fine-tuned using ChatML format. For best results, structure prompts like:

<|im_start|>user
Your question here<|im_end|>
<|im_start|>assistant

Multi-turn Conversation

conversation = """<|im_start|>user
What is the speed of light?<|im_end|>
<|im_start|>assistant
The speed of light is approximately 299,792 kilometers per second.<|im_end|>
<|im_start|>user
How long does it take light to reach Earth from the Sun?<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(conversation, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Example Outputs

Prompt	Response
What is photosynthesis?	Photosynthesis is the process by which plants and other organisms use sunlight, water, and carbon dioxide to produce energy and produce oxygen.
What is the speed of light?	The speed of light is approximately 299,792 kilometers per second.
What are the three states of matter?	The three states of matter are: 1. Solid 2. Liquid 3. Gas.
How does a vaccine work?	A vaccine is a biological agent that is designed to protect the body from harmful pathogens, such as bacteria, viruses, and parasites.

Benchmark Comparison

Zero-Shot Performance vs All Sub-200M Models

Model	Params	Training Data	HellaSwag	ARC-Challenge	PIQA	WinoGrande	MMLU	GSM8K
Jeeves	95M	~2B tokens	33.5%	26.8%	64.8%	52.4%	25.3%	1.7%
Cerebras-GPT	111M	~2.6B tokens	26.8%	16.6%	59.4%	48.8%	—	—
OPT	125M	180B tokens	29.2%	22.9%	~62%	51.6%	26.0%	0.2%
GPT-Neo	125M	300B tokens	30.3%	22.9%	—	51.8%	26.0%	0.3%
SmolLM	135M	600B tokens	41.2%	—	68.4%	51.3%	30.2%	1.0%
SmolLM2	135M	2T tokens	42.1%	—	68.4%	51.3%	31.5%	1.4%
GPT-2	137M	~40B tokens	31.5%	—	—	50.4%	25.8%	0.7%
Pythia	160M	300B tokens	29.3%	18.1%	62.7%	51.9%	—	—

Models Jeeves Outperforms (with fewer parameters & less data)

vs Cerebras-GPT 111M (17% more params, similar data budget):

Jeeves wins on ALL shared benchmarks: HellaSwag +6.7pp, ARC-Challenge +10.2pp, PIQA +5.4pp, WinoGrande +3.6pp

vs OPT-125M (32% more params, 90x more training data):

Jeeves wins: HellaSwag +4.3pp, ARC-Challenge +3.9pp, PIQA +2.8pp, WinoGrande +0.8pp, GSM8K +1.5pp

vs GPT-Neo 125M (32% more params, 150x more training data):

Jeeves wins: HellaSwag +3.2pp, WinoGrande +0.6pp, GSM8K +1.4pp

vs GPT-2 137M (44% more params, 20x more training data):

Jeeves wins: HellaSwag +2.0pp, WinoGrande +2.0pp, GSM8K +1.0pp

vs Pythia 160M (68% more params, 150x more training data):

Jeeves wins on ALL shared benchmarks: HellaSwag +4.2pp, ARC-Challenge +8.7pp, PIQA +2.1pp, WinoGrande +0.5pp

Models That Beat Jeeves

SmolLM-135M and SmolLM2-135M outperform Jeeves on HellaSwag, PIQA, and MMLU — but were trained on 600B and 2T tokens respectively (300–1000x more data) using 64 H100 GPUs. Jeeves was trained on ~2B tokens.

Training Efficiency

Model	Params	Training Tokens	HellaSwag per B tokens
Jeeves	95M	~2B	16.75
OPT-125M	125M	180B	0.16
GPT-Neo 125M	125M	300B	0.10
SmolLM2-135M	135M	2,000B	0.02
Pythia 160M	160M	300B	0.10

Jeeves achieves 100–800x better benchmark-per-token efficiency than comparable models, demonstrating that architecture innovation (looped transformers + value residual learning) can dramatically reduce the data and compute needed to reach competitive performance.

Architecture

Jeeves uses a Looped Transformer — a single middle block is run multiple times with input injection, giving effective depth much larger than the unique parameter count.

Input → [Early Layers 0-10] → [Loop Block 11 × 6 iters] → [Late Layers 12-21] → Output
                                      ↑          |
                                      +----------+  (input injection)

Each loop iteration reuses the same weights, so the model gets 27 effective layers of processing with only 22 unique layer parameter sets.

Component	Value
Parameters	96.3M (unique)
Effective depth	27 layers (via looping)
Unique layers	22
Loop config	block[11] × 6 iterations
Value residual	✅
Hidden dim	576
FFN dim	1,536
Attention heads	9 (Q) / 3 (KV)
Vocab size	32,000
Max seq length	1,024

Key Innovations

Looped Transformer (arXiv 2311.12424) — weight sharing via block looping for parameter efficiency
Value Residual Learning (arXiv 2410.17897) — first-layer value residuals prevent representation collapse in deep/looped networks
Input Injection — adds pre-loop hidden state back during each loop iteration for training stability
Grouped Query Attention — 9 query heads with 3 key-value heads for efficient inference

Training Pipeline

Stage	Data	Details
Pre-training	~2B tokens	FineWeb-Edu, Cosmopedia, Python-Edu, OpenWebMath, StarCoder
Chat SFT	ChatML conversations	Instruction tuning for conversational ability
Tool SFT	Function-calling data	JSON tool calls with `<\|tool_call\|>` and `<\|tool_result\|>` markers

Special Tokens

Token	ID	Purpose
`<pad>`	0	Padding
`<s>`	1	Beginning of sequence
`</s>`	2	End of sequence
`<	im_start	>`
`<	im_end	>`
`<	tool_call	>`
`<	tool_result	>`

Limitations

96M parameters — this is a small research model, not a production system
SmolLM/SmolLM2 (135M) achieve higher absolute scores with 300–1000x more training data
May hallucinate facts, especially for complex math or rare knowledge
Repetition in longer outputs is common at this scale
Best suited for simple Q&A, short-form generation, and research into efficient architectures

License

Apache 2.0

Citation

@misc{jeeves2026,
  title={Jeeves: Efficient Language Modeling with Looped Transformers and Value Residual Learning},
  author={Anurich},
  year={2026},
  url={https://huggingface.co/Anurich/Jeeves-Small-95M}
}

Downloads last month: 11

Safetensors

Model size

0.1B params

Tensor type

F32

Papers for Anurich/Jeeves-Small-95M

Value Residual Learning For Alleviating Attention Concentration In Transformers

Paper • 2410.17897 • Published Oct 23, 2024 • 9

Looped Transformers are Better at Learning Learning Algorithms

Paper • 2311.12424 • Published Nov 21, 2023 • 1