t5-ladino-espanol
This model translates from modern Spanish into Judeo-Spanish (Ladino), a historical language of the Sephardic Jewish community.
This model is a fine-tuned version of google-t5/t5-small trained on the collectivat/una-fraza-al-diya dataset, a multilingual corpus designed to support the documentation and preservation of Judeo-Spanish (Ladino), an endangered language spoken historically by Sephardic Jewish communities.
It achieves the following results on the evaluation set:
- Loss: 3.3840
- BLEU: 0.0
- Generated Length: 5.0 tokens
Model description
This model is based on the T5 architecture and was fine-tuned for a sequence-to-sequence translation task. The goal is to generate translations from Spanish into Ladino, using a small parallel corpus of aligned phrases.
Intended uses & limitations
The model is intended for:
- Educational or cultural projects related to the Judeo-Spanish language.
- Language preservation and revitalization efforts.
- Demonstration of machine translation capabilities for low-resource and endangered languages.
Limitations:
- The model was trained on a very small dataset (only 307 sentence pairs).
- It may produce short or incomplete translations.
- Orthographic variation is expected, as Ladino does not have a standardized modern spelling.
Training and evaluation data
The training data comes from the dataset collectivat/una-fraza-al-diya, which contains 307 aligned phrases in Ladino, Spanish, Turkish, and English. The dataset was developed by the Sephardic Center of Istanbul as part of a cultural preservation initiative. Only the Spanish-Ladino pairs were used for training this model.
The dataset was split into:
- Training set: 245 examples (80%)
- Validation set: 31 examples (10%)
- Test set: 31 examples (10%)
Training procedure
The model was fine-tuned using the Seq2SeqTrainer class from Hugging Face's transformers library.
Training hyperparameters
The following hyperparameters were used:
- learning_rate: 5.6e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
- lr_scheduler_type: linear
- num_epochs: 2
Training results
| Training Loss | Epoch | Step | Validation Loss | BLEU | Gen Len |
|---|---|---|---|---|---|
| No log | 1.0 | 10 | 3.5388 | 0.0 | 5.0 |
| No log | 2.0 | 20 | 3.3840 | 0.0 | 5.0 |
Framework versions
- Transformers: 4.49.0
- PyTorch: 2.6.0+cu124
- Datasets: 3.4.1
- Tokenizers: 0.21.1
Training results
| Training Loss | Epoch | Step | Validation Loss | Bleu | Gen Len |
|---|---|---|---|---|---|
| No log | 1.0 | 10 | 3.5388 | 0.0 | 5.0 |
| No log | 2.0 | 20 | 3.3840 | 0.0 | 5.0 |
Framework versions
- Transformers 4.49.0
- Pytorch 2.6.0+cu124
- Datasets 3.4.1
- Tokenizers 0.21.1
- Downloads last month
- 3
Model tree for rebego/t5-ladino-espanol
Base model
google-t5/t5-small