Instructions to use Anurich/Jeeves-Small-75M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Anurich/Jeeves-Small-75M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Anurich/Jeeves-Small-75M", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Anurich/Jeeves-Small-75M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Anurich/Jeeves-Small-75M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-75M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Anurich/Jeeves-Small-75M

SGLang

How to use Anurich/Jeeves-Small-75M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Anurich/Jeeves-Small-75M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-75M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Anurich/Jeeves-Small-75M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-75M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Anurich/Jeeves-Small-75M with Docker Model Runner:
```
docker model run hf.co/Anurich/Jeeves-Small-75M
```

Configuration Parsing Warning:In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string

Jeeves-Small-75M

A compact 75M parameter language model built on Looped Transformer and Value Residual Learning architectures — with native support for tool calling / function calling.

Jeeves is designed to punch above its weight class by reusing a small set of transformer layers iteratively (looping), giving it an effective depth far beyond what its parameter count suggests.

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)

inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Note: trust_remote_code=True is required due to custom model architecture code.

Tool Calling (Function Calling)

Jeeves supports structured tool/function calling out of the box. Below is an example:

tools = [
    {
        "name": "get_weather",
        "description": "Get the current weather for a given location.",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
            },
            "required": ["location"]
        }
    }
]

messages = [
    {"role": "user", "content": "What's the weather like in London?"}
]

# Format prompt with tools using the chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tools=tools,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Architecture

Component	Value
Parameters	74.9M
Unique layers	8
Effective depth	15
Loop	block[4] × 8
Value residual	✅
Hidden dim	768
FFN dim	2,048
Attention heads	12 (Q) / 4 (KV) — GQA
Vocab size	32,000
Max seq length	512
Training steps	1,100

Key Innovations

Looped Transformer (arXiv:2311.12424) — A single transformer block is applied repeatedly in a loop, dramatically increasing effective depth while keeping parameter count small. This allows Jeeves to reason iteratively rather than in a single pass.
Value Residual Learning (arXiv:2410.17897) — Residual connections applied at the value projection level alleviate attention concentration in deep/looped networks, improving gradient flow and stability.
Input Injection — The original input is re-injected at each loop iteration to prevent representational drift across loops, a critical stabilization technique for looped architectures.

Benchmark Results

Evaluated using EleutherAI lm-evaluation-harness.

Benchmark	Accuracy	Correct	Total
HellaSwag	30.9%	3,100	10,042
ARC-Easy	47.1%	1,118	2,376
ARC-Challenge	24.9%	292	1,172
ARC (Average)	36.0%	—	—
PIQA	63.9%	1,174	1,838
WinoGrande	52.4%	664	1,267
MMLU	25.2%	3,536	14,042
TruthfulQA	24.8%	203	817
GSM8K	1.4%	18	1,319
IFEval	40.0%	4	10

Notes on Results

PIQA (63.9%) and WinoGrande (52.4%) are the strongest results, indicating reasonable physical commonsense and pronoun-resolution reasoning for the model's size.
MMLU (25.2%) is close to random (25% for 4-way MCQ), which is expected given the model's size and early training stage (1,100 steps). More training is needed for knowledge-heavy tasks.
GSM8K (1.4%) reflects a known limitation: multi-step mathematical reasoning is very demanding and typically requires much larger models or specialized fine-tuning.
IFEval (40.0%) is promising for a 75M model and reflects the tool-calling and instruction-following training signal.

Limitations

Short context (512 tokens): Jeeves currently supports a maximum of 512 tokens. Long documents, multi-turn conversations, and complex tool chains may be truncated.
Early training stage: At 1,100 training steps, this is an early checkpoint. Knowledge-heavy and math benchmarks (MMLU, GSM8K) will improve significantly with more training.
Not suitable for factual retrieval: Like all small language models, Jeeves may hallucinate facts. It is best used with grounding via tool calls or RAG pipelines.
English-centric: Trained primarily on English data. Performance on other languages is not guaranteed.

Intended Use

Jeeves is designed for:

On-device / edge inference where a small footprint is critical
Tool-augmented agents that rely on function calling rather than parametric knowledge
Research into efficient architectures (looped transformers, value residual)
Fine-tuning on domain-specific tasks where a small, fast base model is preferred

Citation

If you use Jeeves in your work, please also cite the papers that inspired its architecture:

@article{looped_transformer_2023,
  title={Looped Transformers are Better at Learning Learning Algorithms},
  author={...},
  journal={arXiv:2311.12424},
  year={2023}
}

@article{value_residual_2024,
  title={Value Residual Learning For Alleviating Attention Concentration In Transformers},
  author={...},
  journal={arXiv:2410.17897},
  year={2024}
}

License

Apache 2.0 — see LICENSE for details.

Downloads last month: 6

Safetensors

Model size

74.9M params

Tensor type

F32

Papers for Anurich/Jeeves-Small-75M

Value Residual Learning For Alleviating Attention Concentration In Transformers

Paper • 2410.17897 • Published Oct 23, 2024 • 9

Looped Transformers are Better at Learning Learning Algorithms

Paper • 2311.12424 • Published Nov 21, 2023 • 1