Instructions to use mii-llm/nesso-0.4B-agentic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mii-llm/nesso-0.4B-agentic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mii-llm/nesso-0.4B-agentic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mii-llm/nesso-0.4B-agentic")
model = AutoModelForCausalLM.from_pretrained("mii-llm/nesso-0.4B-agentic")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mii-llm/nesso-0.4B-agentic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mii-llm/nesso-0.4B-agentic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mii-llm/nesso-0.4B-agentic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mii-llm/nesso-0.4B-agentic

SGLang

How to use mii-llm/nesso-0.4B-agentic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mii-llm/nesso-0.4B-agentic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mii-llm/nesso-0.4B-agentic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mii-llm/nesso-0.4B-agentic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mii-llm/nesso-0.4B-agentic",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mii-llm/nesso-0.4B-agentic with Docker Model Runner:
```
docker model run hf.co/mii-llm/nesso-0.4B-agentic
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Nesso-0.4B-Agentic

Nesso-0.4B-Agentic is a bilingual English/Italian Small Language Model (SLM) optimized for function calling, structured output generation, and agentic execution patterns. It is post-trained on top of Zagreus-0.4B-ita, a foundational model trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.

Designed for sovereign edge inference, Nesso-0.4B-Agentic targets deployment scenarios that require reliable tool use, structured JSON output, and multi-step agentic reasoning — all within a compact ~400M parameter footprint.

⚠️ This model is currently at the SFT (Supervised Fine-Tuning) stage. DPO (Direct Preference Optimization) training is planned and updated results will be published upon completion.

Model Details

Property	Value
Architecture	Modified Llama-3.2 (fully dense)
Parameters	~400M
Hidden size	960
Layers	32
Attention heads	15 (KV heads: 5)
Context length	4096 tokens
Tokenizer	Llama-3.2 (`vocab_size`: 128,256)
Precision	BF16
Languages	English, Italian
Base model	mii-llm/zagreus-0.4B-ita
Post-training framework	Axolotl + FSDP
Chat template	ChatML

Training Details

Base Model Pre-training

Nesso-0.4B-Agentic is built on Zagreus-0.4B-ita, which was pre-trained on approximately 1 trillion tokens using the following data mix:

Dataset	Description
FineWeb (350BT sample)	~350B tokens of English web text
FineWeb-2 (ita_Latn)	Italian web text
FinePDFs (ita_Latn)	Italian PDF documents
StarCoder Data	~250B tokens of code

Token distribution: ~400B English + ~400B Italian + ~200B Code
Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC
Framework: Nanotron (mii-llm fork)

Post-training (SFT)

Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs).

The instruction dataset is a proprietary bilingual (English/Italian) corpus curated by the mii-llm team, with dedicated focus on function calling, structured JSON output, tool orchestration, and agentic execution patterns. This dataset was built through years of iteration across domains including finance, cybersecurity, and multi-step agentic workflows, and is considered a strategic research asset not released as open source.

Key hyperparameters:

Hyperparameter	Value
Optimizer	AdamW (fused)
Learning rate	`1e-3`
LR scheduler	Cosine (constant ratio: 0.8, min ratio: 0.3)
Epochs	3
Micro batch size	1
Gradient accumulation steps	8
Sequence length	4096
Max grad norm	1.0
Precision	BF16 + Flash Attention
FSDP strategy	FULL_SHARD

Chat Template

This model uses the ChatML format:

<|im_start|>system
You are a helpful assistant with access to tools.<|im_end|>
<|im_start|>user
What is the weather in Rome today?<|im_end|>
<|im_start|>assistant

Special tokens:

pad_token: <|im_end|>
eos_token: <|im_end|>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mii-llm/nesso-0.4B-agentic"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)


import re

def chat(messages, tools=None, max_tokens=256):
    prompt = tokenizer.apply_chat_template(
        messages,
        tools=tools,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=False,
        temperature=0.5,
        top_p=1.0,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
    )

    text = tokenizer.decode(outputs[0], skip_special_tokens=False)

    blocks = re.findall(
        r"<\|im_start\|>assistant\s*(.*?)<\|im_end\|>",
        text,
        flags=re.S
    )

    answer = blocks[-1].strip() if blocks else text.strip()

    print("\n=== RAW OUTPUT ===\n")
    print(text)
    print("\n=== PARSED ASSISTANT ===\n")
    print(answer)

    return answer

system_prompt = (
    "Sei un assistente che può usare strumenti.\n"
    "Quando servono informazioni esterne, chiama una funzione.\n"
    "Usa ESATTAMENTE il formato <tool_call> previsto."
)

# ----- TOOL DEFINITIONS -----
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Ritorna il meteo per una città",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string"}
                },
                "required": ["city"]
            }
        }
    }
]

# ----- MESSAGES -----
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Che tempo fa a Milano?"}
]

out = chat(messages, tools=tools)

💡 Tip: For function calling and structured output tasks, we recommend using a lower temperature (0.1–0.3) to improve JSON validity and output consistency.

Evaluation

We used our fork of lm-evaluation-harness for multilingual

Evaluation Commands

# Italian benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks ifeval-ita --device cuda:0 --batch_size 1

# English benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks mmlu --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks hellaswag,arc --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks ifeval --device cuda:0 --batch_size 1

Results

English Benchmarks

Model	IFEval EN ↑	ARC EN ↑	HellaSwag EN ↑	MMLU EN ↑	Avg EN
Qwen/Qwen3-0.6B	0.2758	0.3430	0.4742	0.4013	0.3736
Nesso-0.4B-instruct	0.3465	0.3003	0.4629	0.2871	0.3492
Nesso-0.4B-agentic	0.2962	0.2534	0.4062	0.2889	0.3112
LiquidAI/LFM2-350M	0.1595	0.2457	0.3092	0.3445	0.2647

Italian Benchmarks

Model	IFEval IT ↑	ARC IT ↑	HellaSwag IT ↑	MMLU IT ↑	Avg IT
Qwen/Qwen3-0.6B	0.3058	0.2729	0.3598	0.4025	0.3353
Nesso-0.4B-instruct	0.2962	0.2874	0.4076	0.2875	0.3197
Nesso-0.4B-agentic	0.2914	0.2541	0.3673	0.2730	0.2965
LiquidAI/LFM2-350M	0.1427	0.2464	0.2994	0.3132	0.2504

Overall

Model	Avg EN	Avg IT	Overall
Qwen/Qwen3-0.6B	0.3736	0.3353	0.3545
Nesso-0.4B-instruct	0.3492	0.3197	0.3345
Nesso-0.4B-agentic	0.3112	0.2965	0.3039
LiquidAI/LFM2-350M	0.2647	0.2504	0.2576

Discussion

Nesso-0.4B-Agentic is trained with a specialization trade-off: its post-training data prioritizes structured output fidelity, tool calling accuracy, and agentic planning over general benchmark performance. As a result, scores on standard academic benchmarks (IFEval, MMLU, ARC) are lower than the instruct variant, which is expected behavior for a task-specialized model.

Nesso-0.4B-Agentic still outperforms LiquidAI/LFM2-350M across all benchmarks in both languages, confirming its quality as a competitive small model. Its real-world advantage over general-purpose models of similar size is best assessed on agentic and function-calling tasks rather than academic benchmarks.

Related Models

Model	Description
Zagreus-0.4B-ita	Base pre-trained model (this model's foundation)
Nesso-0.4B-instruct	Optimized for conversational and instruction-following tasks
Open-Zagreus-0.4B	Fully open-source SFT variant

Citation

If you use this model in your research, please cite:

@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}

Acknowledgements

Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
The mii-llm open-source community