Open-Zagreus-0.4B

Open-Zagreus-0.4B is a fully open-source bilingual English/Italian Small Language Model (SLM) — open data, open weights, open recipe. It is post-trained on top of Zagreus-0.4B-ita using the publicly available OpenItalianData dataset published by Michele Montebovi, making the entire pipeline — from pre-training data to final weights — fully reproducible by anyone.

This model is released by the mii-llm community (Made in Italy – Large Language Model) as a contribution to the open-source Italian NLP ecosystem, demonstrating that it is possible to build competitive English/Italian language models using exclusively open resources.

Fully open: all training data, model weights, and training recipes are publicly available and reproducible.


Model Details

Property Value
Architecture Modified Llama-3.2 (fully dense)
Parameters ~400M
Hidden size 960
Layers 32
Attention heads 15 (KV heads: 5)
Context length 4096 tokens
Tokenizer Llama-3.2 (vocab_size: 128,256)
Precision BF16
Languages English, Italian
Base model mii-llm/zagreus-0.4B-ita
SFT dataset DeepMount00/OpenItalianData
Post-training framework Axolotl + FSDP
Chat template ChatML

Training Details

Base Model Pre-training

Open-Zagreus-0.4B is built on Zagreus-0.4B-ita, pre-trained on approximately 1 trillion tokens:

Dataset Description
FineWeb (350BT sample) ~350B tokens of English web text
FineWeb-2 (ita_Latn) Italian web text
FinePDFs (ita_Latn) Italian PDF documents
StarCoder Data ~250B tokens of code

Token distribution: ~400B English + ~400B Italian + ~200B Code
Infrastructure: 64× NVIDIA A100 GPUs (8 nodes × 8 GPUs) on Seeweb HPC
Pre-training framework: Nanotron (mii-llm fork)

Post-training (SFT)

Post-training was performed using Axolotl with FSDP across 4 nodes (32× A100 GPUs), using the fully public OpenItalianData dataset.

SFT dataset: DeepMount00/OpenItalianData
Dataset author: Michele Montebovi

Key hyperparameters:

Hyperparameter Value
Optimizer AdamW (fused)
Learning rate 1e-3
LR scheduler Cosine (constant ratio: 0.8, min ratio: 0.3)
Epochs 3
Micro batch size 1
Gradient accumulation steps 8
Sequence length 4096
Max grad norm 1.0
Precision BF16 + Flash Attention
FSDP strategy FULL_SHARD

Full Axolotl Configuration

base_model: giux78/zagreus-0.4B-ita
strict: false
output_dir: ./ale_outputs/opendata-zagreus-sft-final
seed: 42
chat_template_jinja: "{%- for message in messages -%}\n    {{- \"<|im_start|>\" + message.role + \"\\n\" + message.content + \"<|im_end|>\" + \"\\n\" -}}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n\t{{- \"<|im_start|>assistant\\n\" -}}\n{%- endif -%}"
datasets:
  - path: /training/openitaliandata
    type: chat_template
    field_messages: conversation
    roles_to_train: ["assistant"]
    train_on_eos: turn

dataset_prepared_path: ./ale_outputs/dataset_cache/opendata-zagreus-sft

sequence_len: 4096
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

cosine_constant_lr_ratio: 0.8
cosine_min_lr_ratio: 0.3

optimizer: adamw_torch_fused
lr_scheduler: constant
learning_rate: 1.0e-03

max_grad_norm: 1.0
micro_batch_size: 1
gradient_accumulation_steps: 8
num_epochs: 3

bf16: auto
flash_attention: true
gradient_checkpointing: true

logging_steps: 10
eval_strategy: steps
eval_steps: 300
save_strategy: steps
save_steps: 500
save_total_limit: 3
val_set_size: 10000

fsdp_config:
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_state_dict_type: FULL_STATE_DICT

special_tokens:
  pad_token: <|im_end|>
  eos_token: <|im_end|>

Chat Template

This model uses the ChatML format:

<|im_start|>system
Sei un assistente utile.<|im_end|>
<|im_start|>user
Ciao! Come posso imparare l'italiano?<|im_end|>
<|im_start|>assistant

Special tokens:

  • pad_token: <|im_end|>
  • eos_token: <|im_end|>

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "mii-llm/open-zagreus-0.4B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "Sei un assistente utile e preciso."},
    {"role": "user", "content": "Raccontami qualcosa di interessante sull'Italia."}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

output = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)

print(tokenizer.decode(output[0][input_ids.shape[-1]:], skip_special_tokens=True))

Evaluation

Standard Benchmarks

Evaluation Command

lm-eval --model hf --model_args pretrained=giux78/Open-Zagreus-0.4B \
  --tasks m_mmlu_it,arc_it,hellaswag_it --device cuda:0 --batch_size 1

Results

Model MMLU IT ↑ ARC IT ↑ HellaSwag IT ↑ Average
Open-Zagreus-0.4B 0.2530 0.3020 0.3608 0.3053

Evalita Benchmark

Evalita is a comprehensive Italian NLP evaluation suite covering a wide range of linguistic tasks. We evaluate Open-Zagreus-0.4B using the evalita-mp tasks and compare it directly against its base model (Zagreus-0.4B-ita) to measure the impact of SFT.

Evaluation Command

lm_eval --model hf \
  --model_args pretrained=giux78/Open-Zagreus-0.4B \
  --tasks evalita-mp \
  --device cuda:0 \
  --batch_size 1

Results: Open-Zagreus-0.4B vs. Zagreus-0.4B-ita (Base)

Task Metric Zagreus-0.4B-ita (base) Open-Zagreus-0.4B (SFT) Δ
Overall acc 0.3226 0.3313 +0.0087
Admission Test acc 0.2137 0.2083 -0.0054
FAQ acc 0.2681 0.2672 -0.0009
Hate Speech Detection f1 0.6056 0.4340 -0.1716
Lexical Substitution f1 0.0000 0.0000 =
NER f1 0.1611 0.1357 -0.0254
Relation Extraction f1 0.1244 0.0000 -0.1244
Sentiment Analysis f1 0.3660 0.3712 +0.0052
Summarization (Fanpage) rouge1 0.1947 0.2305 +0.0358
Text Entailment acc 0.5133 0.5492 +0.0359
Word in Context f1 0.4697 0.4880 +0.0183

Discussion

The SFT stage delivers a net +0.0087 overall improvement on Evalita. Gains are most significant in generative and semantic tasks:

  • Summarization (+0.0358): the model produces more coherent and relevant summaries after instruction tuning
  • Text Entailment (+0.0359): improved language understanding and reasoning
  • Word in Context (+0.0183): better contextual semantic disambiguation
  • Sentiment Analysis (+0.0052): marginal improvement in affective understanding

Some structured classification tasks (Hate Speech Detection, Relation Extraction, NER) regress after SFT — a known phenomenon when general-purpose instruction tuning shifts the model away from the specific output format expected by these extractive tasks. This is expected behavior and not indicative of degraded general language quality.

Overall, these results confirm that a fully open-source pipeline — using only publicly available data and tools — is sufficient to produce a competitive Italian SLM.


Reproducibility

This is the only model in the Nesso/Zagreus family where every component is fully open and reproducible:

Component Resource
Pre-training data FineWeb, FineWeb-2, FinePDFs, StarCoder (all public)
Pre-training framework mii-llm/nanotron
SFT data DeepMount00/OpenItalianData
SFT framework Axolotl
Evaluation lm-evaluation-harness
Model weights This repository
Training config See Axolotl configuration above

Related Models

Model Description
Zagreus-0.4B-ita Base pre-trained model (this model's foundation)
Nesso-0.4B-instruct Proprietary SFT — optimized for instruction following
Nesso-0.4B-agentic Proprietary SFT — optimized for function calling and agentic tasks

Citation

If you use this model in your research, please cite:

@misc{nesso2025,
  title        = {The Joy and Pain of Training an LLM from Scratch:
                  A Technical Report on the Zagreus and Nesso Model Families},
  author       = {mii-llm community},
  year         = {2025},
  howpublished = {\url{https://github.com/mii-llm/zagreus-nesso-slm}},
}

Acknowledgements

  • Antonio Baldassarra (CEO, Seeweb) and Marco Cristofanilli (Head of AI, Seeweb) for infrastructure sponsorship
  • Michele Montebovi for publishing the OpenItalianData SFT dataset that makes this model fully reproducible
  • The Hugging Face team for Nanotron, datatrove, FineWeb, and FineWeb-2
  • The mii-llm open-source community

License

Released under the Apache 2.0 license.

Made with ❤️ in Italy by mii-llm

Downloads last month
14
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mii-llm/open-zagreus-0.4B

Finetuned
(3)
this model
Quantizations
1 model

Dataset used to train mii-llm/open-zagreus-0.4B

Collection including mii-llm/open-zagreus-0.4B