mt5-it2mnt-v3
Italian β Montatese Neural Translator (mT5)
This repository contains mt5-it2mnt-v3, the third iteration of a neural machine translation model based on mT5, fine-tuned to translate between Italian and Montatese, a local Piedmontese dialect spoken in MontΓ (CN, Italy).
The project aims to preserve and digitally support a minority dialect through modern NLP techniques, combining curated linguistic data with transformer-based multilingual models.
π€ Supported Translation Directions
The model supports explicit bilingual translation, controlled via a prefix:
| Direction | Prefix |
|---|---|
| Italian β Montatese | >>it-mt<< |
| Montatese β Italian | >>mt-it<< |
β οΈ The prefix is mandatory for correct translations.
π§ Model Details
- Base model:
google/mt5-small - Architecture: EncoderβDecoder (Seq2Seq)
- Training type: Fine-tuning
- Version: v3
- Framework: Hugging Face Transformers
- Tokenizer: SentencePiece (shared mT5 tokenizer, extended during training)
π Dataset
The training dataset is a custom bilingual corpus built manually and incrementally, containing:
- Italian sentences
- Montatese translations (with correct local orthography and diacritics)
- Explicit direction labels
Key characteristics:
- Curated and cleaned across multiple iterations (v1 β v3)
- Consistent column schema (
source_text,target_text,direction) - Focus on real spoken language, not normalized Italian
The dataset is not fully public due to its artisanal and evolving nature.
ποΈ Training Summary
- Training performed locally on CPU
- Resume from intermediate checkpoint after interruption
- Final training completed successfully (epoch = 1)
- No data loss during resume
- Stable loss and evaluation metrics
Indicative metrics:
train_loss β 0.95eval_loss β 1.15
π Usage Example
Python (Transformers)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
MODEL_ID = "traduttoremontatese/mt5-it2mnt-v3"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_ID)
text = ">>it-mt<< domani andiamo a mangiare insieme"
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 28