Nesso-0.4B-Agentic
Nesso-0.4B-Agentic is a bilingual English/Italian Small Language Model (SLM) optimized for function calling, structured output generation, and agentic execution patterns. It is post-trained on top of Zagreus-0.4B-ita, a foundational model trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.
Designed for sovereign edge inference, Nesso-0.4B-Agentic targets deployment scenarios that require reliable tool use, structured JSON output, and multi-step agentic reasoning — all within a compact ~400M parameter footprint.
⚠️ This model is currently at the SFT (Supervised Fine-Tuning) stage. DPO (Direct Preference Optimization) training is planned and updated results will be published upon completion.
Model Details
| Property | Value |
|---|---|
| Architecture | Modified Llama-3.2 (fully dense) |
| Parameters | ~400M |
| Hidden size | 960 |
| Layers | 32 |
| Attention heads | 15 (KV heads: 5) |
| Context length | 4096 tokens |
| Tokenizer | Llama-3.2 (vocab_size: 128,256) |
| Precision | BF16 |
| Languages | English, Italian |
| Base model | mii-llm/zagreus-0.4B-ita |
| Post-training framework | Axolotl + FSDP |
| Chat template | ChatML |
Training Details
Base Model Pre-training
Nesso-0.4B-Agentic is built on Zagreus-0.4B-ita, which was pre-trained on approximately 1 trillion tokens using the following data mix:
| Dataset | Description |
|---|---|
| FineWeb (350BT sample) | ~350B tokens of English web text |
| FineWeb-2 (ita_Latn) | Italian web text |
| FinePDFs (ita_Latn) | Italian PDF documents |
| StarCoder Data | ~250B tokens of code |
Token distribution: ~400B English + ~400B Italian + ~200B Code
Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC
Framework: Nanotron (mii-llm fork)
Post-training (SFT)
Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs).
The instruction dataset is a proprietary bilingual (English/Italian) corpus curated by the mii-llm team, with dedicated focus on function calling, structured JSON output, tool orchestration, and agentic execution patterns. This dataset was built through years of iteration across domains including finance, cybersecurity, and multi-step agentic workflows, and is considered a strategic research asset not released as open source.
Key hyperparameters:
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW (fused) |
| Learning rate | 1e-3 |
| LR scheduler | Cosine (constant ratio: 0.8, min ratio: 0.3) |
| Epochs | 3 |
| Micro batch size | 1 |
| Gradient accumulation steps | 8 |
| Sequence length | 4096 |
| Max grad norm | 1.0 |
| Precision | BF16 + Flash Attention |
| FSDP strategy | FULL_SHARD |
Chat Template
This model uses the ChatML format:
<|im_start|>system
You are a helpful assistant with access to tools.<|im_end|>
<|im_start|>user
What is the weather in Rome today?<|im_end|>
<|im_start|>assistant
Special tokens:
pad_token:<|im_end|>eos_token:<|im_end|>
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "mii-llm/nesso-0.4B-agentic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
import re
def chat(messages, tools=None, max_tokens=256):
prompt = tokenizer.apply_chat_template(
messages,
tools=tools,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
do_sample=False,
temperature=0.5,
top_p=1.0,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
)
text = tokenizer.decode(outputs[0], skip_special_tokens=False)
blocks = re.findall(
r"<\|im_start\|>assistant\s*(.*?)<\|im_end\|>",
text,
flags=re.S
)
answer = blocks[-1].strip() if blocks else text.strip()
print("\n=== RAW OUTPUT ===\n")
print(text)
print("\n=== PARSED ASSISTANT ===\n")
print(answer)
return answer
system_prompt = (
"Sei un assistente che può usare strumenti.\n"
"Quando servono informazioni esterne, chiama una funzione.\n"
"Usa ESATTAMENTE il formato <tool_call> previsto."
)
# ----- TOOL DEFINITIONS -----
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Ritorna il meteo per una città",
"parameters": {
"type": "object",
"properties": {
"city": {"type": "string"}
},
"required": ["city"]
}
}
}
]
# ----- MESSAGES -----
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Che tempo fa a Milano?"}
]
out = chat(messages, tools=tools)
💡 Tip: For function calling and structured output tasks, we recommend using a lower temperature (
0.1–0.3) to improve JSON validity and output consistency.
Evaluation
We used our fork of lm-evaluation-harness for multilingual
Evaluation Commands
# Italian benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
--tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
--tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
--tasks ifeval-ita --device cuda:0 --batch_size 1
# English benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
--tasks mmlu --num_fewshot 5 --device cuda:0 --batch_size 1
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
--tasks hellaswag,arc --device cuda:0 --batch_size 1
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
--tasks ifeval --device cuda:0 --batch_size 1
Results
English Benchmarks
| Model | IFEval EN ↑ | ARC EN ↑ | HellaSwag EN ↑ | MMLU EN ↑ | Avg EN |
|---|---|---|---|---|---|
| Qwen/Qwen3-0.6B | 0.2758 | 0.3430 | 0.4742 | 0.4013 | 0.3736 |
| Nesso-0.4B-instruct | 0.3465 | 0.3003 | 0.4629 | 0.2871 | 0.3492 |
| Nesso-0.4B-agentic | 0.2962 | 0.2534 | 0.4062 | 0.2889 | 0.3112 |
| LiquidAI/LFM2-350M | 0.1595 | 0.2457 | 0.3092 | 0.3445 | 0.2647 |
Italian Benchmarks
| Model | IFEval IT ↑ | ARC IT ↑ | HellaSwag IT ↑ | MMLU IT ↑ | Avg IT |
|---|---|---|---|---|---|
| Qwen/Qwen3-0.6B | 0.3058 | 0.2729 | 0.3598 | 0.4025 | 0.3353 |
| Nesso-0.4B-instruct | 0.2962 | 0.2874 | 0.4076 | 0.2875 | 0.3197 |
| Nesso-0.4B-agentic | 0.2914 | 0.2541 | 0.3673 | 0.2730 | 0.2965 |
| LiquidAI/LFM2-350M | 0.1427 | 0.2464 | 0.2994 | 0.3132 | 0.2504 |
Overall
| Model | Avg EN | Avg IT | Overall |
|---|---|---|---|
| Qwen/Qwen3-0.6B | 0.3736 | 0.3353 | 0.3545 |
| Nesso-0.4B-instruct | 0.3492 | 0.3197 | 0.3345 |
| Nesso-0.4B-agentic | 0.3112 | 0.2965 | 0.3039 |
| LiquidAI/LFM2-350M | 0.2647 | 0.2504 | 0.2576 |
Discussion
Nesso-0.4B-Agentic is trained with a specialization trade-off: its post-training data prioritizes structured output fidelity, tool calling accuracy, and agentic planning over general benchmark performance. As a result, scores on standard academic benchmarks (IFEval, MMLU, ARC) are lower than the instruct variant, which is expected behavior for a task-specialized model.
Nesso-0.4B-Agentic still outperforms LiquidAI/LFM2-350M across all benchmarks in both languages, confirming its quality as a competitive small model. Its real-world advantage over general-purpose models of similar size is best assessed on agentic and function-calling tasks rather than academic benchmarks.
Related Models
| Model | Description |
|---|---|
| Zagreus-0.4B-ita | Base pre-trained model (this model's foundation) |
| Nesso-0.4B-instruct | Optimized for conversational and instruction-following tasks |
| Open-Zagreus-0.4B | Fully open-source SFT variant |
Citation
If you use this model in your research, please cite:
@misc{nesso2025,
title = {The Joy and Pain of Training an LLM from Scratch:
A Technical Report on the Zagreus and Nesso Model Families},
author = {mii-llm community},
year = {2025},
howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}
Acknowledgements
- Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
- The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
- The mii-llm open-source community
License
Released under the Apache 2.0 license.
Made with ❤️ in Italy by mii-llm
- Downloads last month
- 57