Nesso-0.4B-Agentic

Nesso-0.4B-Agentic is a bilingual English/Italian Small Language Model (SLM) optimized for function calling, structured output generation, and agentic execution patterns. It is post-trained on top of Zagreus-0.4B-ita, a foundational model trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.

Designed for sovereign edge inference, Nesso-0.4B-Agentic targets deployment scenarios that require reliable tool use, structured JSON output, and multi-step agentic reasoning — all within a compact ~400M parameter footprint.

⚠️ This model is currently at the SFT (Supervised Fine-Tuning) stage. DPO (Direct Preference Optimization) training is planned and updated results will be published upon completion.


Model Details

Property Value
Architecture Modified Llama-3.2 (fully dense)
Parameters ~400M
Hidden size 960
Layers 32
Attention heads 15 (KV heads: 5)
Context length 4096 tokens
Tokenizer Llama-3.2 (vocab_size: 128,256)
Precision BF16
Languages English, Italian
Base model mii-llm/zagreus-0.4B-ita
Post-training framework Axolotl + FSDP
Chat template ChatML

Training Details

Base Model Pre-training

Nesso-0.4B-Agentic is built on Zagreus-0.4B-ita, which was pre-trained on approximately 1 trillion tokens using the following data mix:

Dataset Description
FineWeb (350BT sample) ~350B tokens of English web text
FineWeb-2 (ita_Latn) Italian web text
FinePDFs (ita_Latn) Italian PDF documents
StarCoder Data ~250B tokens of code

Token distribution: ~400B English + ~400B Italian + ~200B Code
Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC
Framework: Nanotron (mii-llm fork)

Post-training (SFT)

Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs).

The instruction dataset is a proprietary bilingual (English/Italian) corpus curated by the mii-llm team, with dedicated focus on function calling, structured JSON output, tool orchestration, and agentic execution patterns. This dataset was built through years of iteration across domains including finance, cybersecurity, and multi-step agentic workflows, and is considered a strategic research asset not released as open source.

Key hyperparameters:

Hyperparameter Value
Optimizer AdamW (fused)
Learning rate 1e-3
LR scheduler Cosine (constant ratio: 0.8, min ratio: 0.3)
Epochs 3
Micro batch size 1
Gradient accumulation steps 8
Sequence length 4096
Max grad norm 1.0
Precision BF16 + Flash Attention
FSDP strategy FULL_SHARD

Chat Template

This model uses the ChatML format:

<|im_start|>system
You are a helpful assistant with access to tools.<|im_end|>
<|im_start|>user
What is the weather in Rome today?<|im_end|>
<|im_start|>assistant

Special tokens:

  • pad_token: <|im_end|>
  • eos_token: <|im_end|>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mii-llm/nesso-0.4B-agentic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)


import re

def chat(messages, tools=None, max_tokens=256):
    prompt = tokenizer.apply_chat_template(
        messages,
        tools=tools,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=0.5,
        top_p=1.0,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    text = tokenizer.decode(outputs[0], skip_special_tokens=False)

    blocks = re.findall(
        r"<\|im_start\|>assistant\s*(.*?)<\|im_end\|>",
        text,
        flags=re.S
    )

    answer = blocks[-1].strip() if blocks else text.strip()

    print("\n=== RAW OUTPUT ===\n")
    print(text)
    print("\n=== PARSED ASSISTANT ===\n")
    print(answer)

    return answer

system_prompt = (
    "Sei un assistente che può usare strumenti.\n"
    "Quando servono informazioni esterne, chiama una funzione.\n"
    "Usa ESATTAMENTE il formato <tool_call> previsto."
)

# ----- TOOL DEFINITIONS -----
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Ritorna il meteo per una città",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"}
                },
                "required": ["city"]
            }
        }
    }
]

# ----- MESSAGES -----
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Che tempo fa a Milano?"}
]

out = chat(messages, tools=tools)

💡 Tip: For function calling and structured output tasks, we recommend using a lower temperature (0.10.3) to improve JSON validity and output consistency.


Evaluation

We used our fork of lm-evaluation-harness for multilingual

Evaluation Commands

# Italian benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks ifeval-ita --device cuda:0 --batch_size 1

# English benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks mmlu --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks hellaswag,arc --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks ifeval --device cuda:0 --batch_size 1

Results

English Benchmarks

Model IFEval EN ↑ ARC EN ↑ HellaSwag EN ↑ MMLU EN ↑ Avg EN
Qwen/Qwen3-0.6B 0.2758 0.3430 0.4742 0.4013 0.3736
Nesso-0.4B-instruct 0.3465 0.3003 0.4629 0.2871 0.3492
Nesso-0.4B-agentic 0.2962 0.2534 0.4062 0.2889 0.3112
LiquidAI/LFM2-350M 0.1595 0.2457 0.3092 0.3445 0.2647

Italian Benchmarks

Model IFEval IT ↑ ARC IT ↑ HellaSwag IT ↑ MMLU IT ↑ Avg IT
Qwen/Qwen3-0.6B 0.3058 0.2729 0.3598 0.4025 0.3353
Nesso-0.4B-instruct 0.2962 0.2874 0.4076 0.2875 0.3197
Nesso-0.4B-agentic 0.2914 0.2541 0.3673 0.2730 0.2965
LiquidAI/LFM2-350M 0.1427 0.2464 0.2994 0.3132 0.2504

Overall

Model Avg EN Avg IT Overall
Qwen/Qwen3-0.6B 0.3736 0.3353 0.3545
Nesso-0.4B-instruct 0.3492 0.3197 0.3345
Nesso-0.4B-agentic 0.3112 0.2965 0.3039
LiquidAI/LFM2-350M 0.2647 0.2504 0.2576

Discussion

Nesso-0.4B-Agentic is trained with a specialization trade-off: its post-training data prioritizes structured output fidelity, tool calling accuracy, and agentic planning over general benchmark performance. As a result, scores on standard academic benchmarks (IFEval, MMLU, ARC) are lower than the instruct variant, which is expected behavior for a task-specialized model.

Nesso-0.4B-Agentic still outperforms LiquidAI/LFM2-350M across all benchmarks in both languages, confirming its quality as a competitive small model. Its real-world advantage over general-purpose models of similar size is best assessed on agentic and function-calling tasks rather than academic benchmarks.


Related Models

Model Description
Zagreus-0.4B-ita Base pre-trained model (this model's foundation)
Nesso-0.4B-instruct Optimized for conversational and instruction-following tasks
Open-Zagreus-0.4B Fully open-source SFT variant

Citation

If you use this model in your research, please cite:

@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}

Acknowledgements

  • Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
  • The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
  • The mii-llm open-source community

License

Released under the Apache 2.0 license.

Made with ❤️ in Italy by mii-llm

Downloads last month
57
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mii-llm/nesso-0.4B-agentic

Finetuned
(3)
this model
Quantizations
1 model

Collection including mii-llm/nesso-0.4B-agentic