t5-ladino-espanol

This model translates from modern Spanish into Judeo-Spanish (Ladino), a historical language of the Sephardic Jewish community.

This model is a fine-tuned version of google-t5/t5-small trained on the collectivat/una-fraza-al-diya dataset, a multilingual corpus designed to support the documentation and preservation of Judeo-Spanish (Ladino), an endangered language spoken historically by Sephardic Jewish communities.

It achieves the following results on the evaluation set:

Loss: 3.3840
BLEU: 0.0
Generated Length: 5.0 tokens

Model description

This model is based on the T5 architecture and was fine-tuned for a sequence-to-sequence translation task. The goal is to generate translations from Spanish into Ladino, using a small parallel corpus of aligned phrases.

Intended uses & limitations

The model is intended for:

Educational or cultural projects related to the Judeo-Spanish language.
Language preservation and revitalization efforts.
Demonstration of machine translation capabilities for low-resource and endangered languages.

Limitations:

The model was trained on a very small dataset (only 307 sentence pairs).
It may produce short or incomplete translations.
Orthographic variation is expected, as Ladino does not have a standardized modern spelling.

Training and evaluation data

The training data comes from the dataset collectivat/una-fraza-al-diya, which contains 307 aligned phrases in Ladino, Spanish, Turkish, and English. The dataset was developed by the Sephardic Center of Istanbul as part of a cultural preservation initiative. Only the Spanish-Ladino pairs were used for training this model.

The dataset was split into:

Training set: 245 examples (80%)
Validation set: 31 examples (10%)
Test set: 31 examples (10%)

Training procedure

The model was fine-tuned using the Seq2SeqTrainer class from Hugging Face's transformers library.

Training hyperparameters

The following hyperparameters were used:

learning_rate: 5.6e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
lr_scheduler_type: linear
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	BLEU	Gen Len
No log	1.0	10	3.5388	0.0	5.0
No log	2.0	20	3.3840	0.0	5.0

Framework versions

Transformers: 4.49.0
PyTorch: 2.6.0+cu124
Datasets: 3.4.1
Tokenizers: 0.21.1

Training results

Training Loss	Epoch	Step	Validation Loss	Bleu	Gen Len
No log	1.0	10	3.5388	0.0	5.0
No log	2.0	20	3.3840	0.0	5.0

Framework versions

Transformers 4.49.0
Pytorch 2.6.0+cu124
Datasets 3.4.1
Tokenizers 0.21.1

Downloads last month: 3

Safetensors

Model size

60.5M params

Tensor type

F32

Model tree for rebego/t5-ladino-espanol

Base model

google-t5/t5-small

Finetuned

(2279)

this model