Nesso-0.4B-Agentic-MLX

Nesso-0.4B-Agentic-MLX is the Apple Silicon-optimized version of Nesso-0.4B-Agentic. It has been converted to the MLX format for high-performance inference on Mac M-series chips.

It is a bilingual English/Italian Small Language Model (SLM) optimized for function calling, structured output generation, and agentic execution patterns. It is post-trained on top of Zagreus-0.4B-ita, a foundational model trained from scratch by the mii-llm community (Made in Italy – Large Language Model) on the Seeweb HPC infrastructure.

Designed for sovereign edge inference, Nesso-0.4B-Agentic targets deployment scenarios that require reliable tool use, structured JSON output, and multi-step agentic reasoning — all within a compact ~400M parameter footprint.

⚠️ This model is currently at the SFT (Supervised Fine-Tuning) stage. DPO (Direct Preference Optimization) training is planned and updated results will be published upon completion.


Model Details

Property Value
Architecture Modified Llama-3.2 (fully dense)
Parameters ~400M
Hidden size 960
Layers 32
Attention heads 15 (KV heads: 5)
Context length 4096 tokens
Tokenizer Llama-3.2 (vocab_size: 128,256)
Format MLX
Languages English, Italian
Base model mii-llm/nesso-0.4B-agentic
Post-training framework Axolotl + FSDP
Chat template ChatML

Chat Template

This model uses the ChatML format:


<|im_start|>system
You are a helpful assistant with access to tools.<|im_end|>
<|im_start|>user
What is the weather in Rome today?<|im_end|>
<|im_start|>assistant

Special tokens:

  • pad_token: <|im_end|>
  • eos_token: <|im_end|>

Usage

Installation

pip install mlx-lm

Inference via Python

from mlx_lm import load, generate

model_id = "mlx-community/nesso-0.4B-agentic-mlx" 

model, tokenizer = load(model_id)

system_prompt = (
    "Sei un assistente che può usare strumenti.\n"
    "Quando servono informazioni esterne, chiama una funzione.\n"
    "Usa ESATTAMENTE il formato <tool_call> previsto."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Che tempo fa a Milano?"}
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

response = generate(model, tokenizer, prompt=prompt, verbose=True, temp=0.3, max_tokens=256)
print(response)

Inference via Terminal

python -m mlx_lm.generate --model mlx-community/nesso-0.4B-agentic-mlx \
  --prompt "<|im_start|>system\nSei un assistente che può usare strumenti.<|im_end|>\n<|im_start|>user\nChe tempo fa a Milano?<|im_end|>\n<|im_start|>assistant\n" \
  --temp 0.3 --max-tokens 256

💡 Tip: For function calling and structured output tasks, we recommend using a lower temperature (0.10.3) to improve JSON validity and output consistency.


Training Details

Base Model Pre-training

Nesso-0.4B-Agentic is built on Zagreus-0.4B-ita, which was pre-trained on approximately 1 trillion tokens using the following data mix:

Dataset Description
FineWeb (350BT sample) ~350B tokens of English web text
FineWeb-2 (ita_Latn) Italian web text
FinePDFs (ita_Latn) Italian PDF documents
StarCoder Data ~250B tokens of code

Token distribution: ~400B English + ~400B Italian + ~200B Code

Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC

Framework: Nanotron (mii-llm fork)

Post-training (SFT)

Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs).

The instruction dataset is a proprietary bilingual (English/Italian) corpus curated by the mii-llm team, with dedicated focus on function calling, structured JSON output, tool orchestration, and agentic execution patterns. This dataset was built through years of iteration across domains including finance, cybersecurity, and multi-step agentic workflows, and is considered a strategic research asset not released as open source.

Key hyperparameters:

Hyperparameter Value
Optimizer AdamW (fused)
Learning rate 1e-3
LR scheduler Cosine (constant ratio: 0.8, min ratio: 0.3)
Epochs 3
Micro batch size 1
Gradient accumulation steps 8
Sequence length 4096
Max grad norm 1.0
Precision BF16 + Flash Attention
FSDP strategy FULL_SHARD

Evaluation

We used our fork of lm-evaluation-harness for multilingual

Evaluation Commands

# Italian benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks m_mmlu_it --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks hellaswag_it,arc_it --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks ifeval-ita --device cuda:0 --batch_size 1

# English benchmarks
lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks mmlu --num_fewshot 5 --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks hellaswag,arc --device cuda:0 --batch_size 1

lm-eval --model hf --model_args pretrained=mii-llm/nesso-0.4B-agentic \
  --tasks ifeval --device cuda:0 --batch_size 1

Results

English Benchmarks

Model IFEval EN ↑ ARC EN ↑ HellaSwag EN ↑ MMLU EN ↑ Avg EN
Qwen/Qwen3-0.6B 0.2758 0.3430 0.4742 0.4013 0.3736
Nesso-0.4B-instruct 0.3465 0.3003 0.4629 0.2871 0.3492
Nesso-0.4B-agentic 0.2962 0.2534 0.4062 0.2889 0.3112
LiquidAI/LFM2-350M 0.1595 0.2457 0.3092 0.3445 0.2647

Italian Benchmarks

Model IFEval IT ↑ ARC IT ↑ HellaSwag IT ↑ MMLU IT ↑ Avg IT
Qwen/Qwen3-0.6B 0.3058 0.2729 0.3598 0.4025 0.3353
Nesso-0.4B-instruct 0.2962 0.2874 0.4076 0.2875 0.3197
Nesso-0.4B-agentic 0.2914 0.2541 0.3673 0.2730 0.2965
LiquidAI/LFM2-350M 0.1427 0.2464 0.2994 0.3132 0.2504

Overall

Model Avg EN Avg IT Overall
Qwen/Qwen3-0.6B 0.3736 0.3353 0.3545
Nesso-0.4B-instruct 0.3492 0.3197 0.3345
Nesso-0.4B-agentic 0.3112 0.2965 0.3039
LiquidAI/LFM2-350M 0.2647 0.2504 0.2576

Discussion

Nesso-0.4B-Agentic is trained with a specialization trade-off: its post-training data prioritizes structured output fidelity, tool calling accuracy, and agentic planning over general benchmark performance. As a result, scores on standard academic benchmarks (IFEval, MMLU, ARC) are lower than the instruct variant, which is expected behavior for a task-specialized model.

Nesso-0.4B-Agentic still outperforms LiquidAI/LFM2-350M across all benchmarks in both languages, confirming its quality as a competitive small model. Its real-world advantage over general-purpose models of similar size is best assessed on agentic and function-calling tasks rather than academic benchmarks.


Related Models

Model Description
Zagreus-0.4B-ita Base pre-trained model (this model's foundation)
Nesso-0.4B-instruct Optimized for conversational and instruction-following tasks
Open-Zagreus-0.4B Fully open-source SFT variant

Citation

If you use this model in your research, please cite:

@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{[https://github.com/mii-llm/zagreus-nesso-slm](https://github.com/mii-llm/zagreus-nesso-slm)}},
}

Acknowledgements

  • Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
  • The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
  • The mii-llm open-source community

License

Released under the Apache 2.0 license.

Made with ❤️ in Italy by mii-llm


Downloads last month
50
Safetensors
Model size
0.4B params
Tensor type
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mlx-community/nesso-0.4B-agentic-mlx

Finetuned
(1)
this model