Configuration Parsing Warning: In UNKNOWN_FILENAME: "auto_map.AutoTokenizer" must be a string
Jeeves-Small-75M
A compact 75M parameter language model built on Looped Transformer and Value Residual Learning architectures β with native support for tool calling / function calling.
Jeeves is designed to punch above its weight class by reusing a small set of transformer layers iteratively (looping), giving it an effective depth far beyond what its parameter count suggests.
Quick Start
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
inputs = tokenizer("Hello, how are you?", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Note:
trust_remote_code=Trueis required due to custom model architecture code.
Tool Calling (Function Calling)
Jeeves supports structured tool/function calling out of the box. Below is an example:
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
]
messages = [
{"role": "user", "content": "What's the weather like in London?"}
]
# Format prompt with tools using the chat template
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Architecture
| Component | Value |
|---|---|
| Parameters | 74.9M |
| Unique layers | 8 |
| Effective depth | 15 |
| Loop | block[4] Γ 8 |
| Value residual | β |
| Hidden dim | 768 |
| FFN dim | 2,048 |
| Attention heads | 12 (Q) / 4 (KV) β GQA |
| Vocab size | 32,000 |
| Max seq length | 512 |
| Training steps | 1,100 |
Key Innovations
- Looped Transformer (arXiv:2311.12424) β A single transformer block is applied repeatedly in a loop, dramatically increasing effective depth while keeping parameter count small. This allows Jeeves to reason iteratively rather than in a single pass.
- Value Residual Learning (arXiv:2410.17897) β Residual connections applied at the value projection level alleviate attention concentration in deep/looped networks, improving gradient flow and stability.
- Input Injection β The original input is re-injected at each loop iteration to prevent representational drift across loops, a critical stabilization technique for looped architectures.
Benchmark Results
Evaluated using EleutherAI lm-evaluation-harness.
| Benchmark | Accuracy | Correct | Total |
|---|---|---|---|
| HellaSwag | 30.9% | 3,100 | 10,042 |
| ARC-Easy | 47.1% | 1,118 | 2,376 |
| ARC-Challenge | 24.9% | 292 | 1,172 |
| ARC (Average) | 36.0% | β | β |
| PIQA | 63.9% | 1,174 | 1,838 |
| WinoGrande | 52.4% | 664 | 1,267 |
| MMLU | 25.2% | 3,536 | 14,042 |
| TruthfulQA | 24.8% | 203 | 817 |
| GSM8K | 1.4% | 18 | 1,319 |
| IFEval | 40.0% | 4 | 10 |
Notes on Results
- PIQA (63.9%) and WinoGrande (52.4%) are the strongest results, indicating reasonable physical commonsense and pronoun-resolution reasoning for the model's size.
- MMLU (25.2%) is close to random (25% for 4-way MCQ), which is expected given the model's size and early training stage (1,100 steps). More training is needed for knowledge-heavy tasks.
- GSM8K (1.4%) reflects a known limitation: multi-step mathematical reasoning is very demanding and typically requires much larger models or specialized fine-tuning.
- IFEval (40.0%) is promising for a 75M model and reflects the tool-calling and instruction-following training signal.
Limitations
- Short context (512 tokens): Jeeves currently supports a maximum of 512 tokens. Long documents, multi-turn conversations, and complex tool chains may be truncated.
- Early training stage: At 1,100 training steps, this is an early checkpoint. Knowledge-heavy and math benchmarks (MMLU, GSM8K) will improve significantly with more training.
- Not suitable for factual retrieval: Like all small language models, Jeeves may hallucinate facts. It is best used with grounding via tool calls or RAG pipelines.
- English-centric: Trained primarily on English data. Performance on other languages is not guaranteed.
Intended Use
Jeeves is designed for:
- On-device / edge inference where a small footprint is critical
- Tool-augmented agents that rely on function calling rather than parametric knowledge
- Research into efficient architectures (looped transformers, value residual)
- Fine-tuning on domain-specific tasks where a small, fast base model is preferred
Citation
If you use Jeeves in your work, please also cite the papers that inspired its architecture:
@article{looped_transformer_2023,
title={Looped Transformers are Better at Learning Learning Algorithms},
author={...},
journal={arXiv:2311.12424},
year={2023}
}
@article{value_residual_2024,
title={Value Residual Learning For Alleviating Attention Concentration In Transformers},
author={...},
journal={arXiv:2410.17897},
year={2024}
}
License
Apache 2.0 β see LICENSE for details.
- Downloads last month
- 106