mt5-it2mnt-v3

Italian ↔ Montatese Neural Translator (mT5)

This repository contains mt5-it2mnt-v3, the third iteration of a neural machine translation model based on mT5, fine-tuned to translate between Italian and Montatese, a local Piedmontese dialect spoken in MontΓ  (CN, Italy).

The project aims to preserve and digitally support a minority dialect through modern NLP techniques, combining curated linguistic data with transformer-based multilingual models.


πŸ”€ Supported Translation Directions

The model supports explicit bilingual translation, controlled via a prefix:

Direction Prefix
Italian β†’ Montatese >>it-mt<<
Montatese β†’ Italian >>mt-it<<

⚠️ The prefix is mandatory for correct translations.


🧠 Model Details

  • Base model: google/mt5-small
  • Architecture: Encoder–Decoder (Seq2Seq)
  • Training type: Fine-tuning
  • Version: v3
  • Framework: Hugging Face Transformers
  • Tokenizer: SentencePiece (shared mT5 tokenizer, extended during training)

πŸ“š Dataset

The training dataset is a custom bilingual corpus built manually and incrementally, containing:

  • Italian sentences
  • Montatese translations (with correct local orthography and diacritics)
  • Explicit direction labels

Key characteristics:

  • Curated and cleaned across multiple iterations (v1 β†’ v3)
  • Consistent column schema (source_text, target_text, direction)
  • Focus on real spoken language, not normalized Italian

The dataset is not fully public due to its artisanal and evolving nature.


πŸ‹οΈ Training Summary

  • Training performed locally on CPU
  • Resume from intermediate checkpoint after interruption
  • Final training completed successfully (epoch = 1)
  • No data loss during resume
  • Stable loss and evaluation metrics

Indicative metrics:

  • train_loss β‰ˆ 0.95
  • eval_loss β‰ˆ 1.15

πŸš€ Usage Example

Python (Transformers)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_ID = "traduttoremontatese/mt5-it2mnt-v3"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

text = ">>it-mt<< domani andiamo a mangiare insieme"

inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
28
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using traduttoremontatese/mt5-it2mnt-v3 1