mt5-it2mnt-v3

Italian ↔ Montatese Neural Translator (mT5)

This repository contains mt5-it2mnt-v3, the third iteration of a neural machine translation model based on mT5, fine-tuned to translate between Italian and Montatese, a local Piedmontese dialect spoken in Montà (CN, Italy).

The project aims to preserve and digitally support a minority dialect through modern NLP techniques, combining curated linguistic data with transformer-based multilingual models.

🔤 Supported Translation Directions

The model supports explicit bilingual translation, controlled via a prefix:

Direction	Prefix
Italian → Montatese	`>>it-mt<<`
Montatese → Italian	`>>mt-it<<`

⚠️ The prefix is mandatory for correct translations.

🧠 Model Details

Base model: google/mt5-small
Architecture: Encoder–Decoder (Seq2Seq)
Training type: Fine-tuning
Version: v3
Framework: Hugging Face Transformers
Tokenizer: SentencePiece (shared mT5 tokenizer, extended during training)

📚 Dataset

The training dataset is a custom bilingual corpus built manually and incrementally, containing:

Italian sentences
Montatese translations (with correct local orthography and diacritics)
Explicit direction labels

Key characteristics:

Curated and cleaned across multiple iterations (v1 → v3)
Consistent column schema (source_text, target_text, direction)
Focus on real spoken language, not normalized Italian

The dataset is not fully public due to its artisanal and evolving nature.

🏋️ Training Summary

Training performed locally on CPU
Resume from intermediate checkpoint after interruption
Final training completed successfully (epoch = 1)
No data loss during resume
Stable loss and evaluation metrics

Indicative metrics:

train_loss ≈ 0.95
eval_loss ≈ 1.15

🚀 Usage Example

Python (Transformers)

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

MODEL_ID = "traduttoremontatese/mt5-it2mnt-v3"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)

text = ">>it-mt<< domani andiamo a mangiare insieme"

inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 28

Safetensors

Model size

0.2B params

Tensor type

F32

traduttoremontatese
/

mt5-it2mnt-v3