t5-ladino-espanol

This model translates from modern Spanish into Judeo-Spanish (Ladino), a historical language of the Sephardic Jewish community.

This model is a fine-tuned version of google-t5/t5-small trained on the collectivat/una-fraza-al-diya dataset, a multilingual corpus designed to support the documentation and preservation of Judeo-Spanish (Ladino), an endangered language spoken historically by Sephardic Jewish communities.

It achieves the following results on the evaluation set:

  • Loss: 3.3840
  • BLEU: 0.0
  • Generated Length: 5.0 tokens

Model description

This model is based on the T5 architecture and was fine-tuned for a sequence-to-sequence translation task. The goal is to generate translations from Spanish into Ladino, using a small parallel corpus of aligned phrases.

Intended uses & limitations

The model is intended for:

  • Educational or cultural projects related to the Judeo-Spanish language.
  • Language preservation and revitalization efforts.
  • Demonstration of machine translation capabilities for low-resource and endangered languages.

Limitations:

  • The model was trained on a very small dataset (only 307 sentence pairs).
  • It may produce short or incomplete translations.
  • Orthographic variation is expected, as Ladino does not have a standardized modern spelling.

Training and evaluation data

The training data comes from the dataset collectivat/una-fraza-al-diya, which contains 307 aligned phrases in Ladino, Spanish, Turkish, and English. The dataset was developed by the Sephardic Center of Istanbul as part of a cultural preservation initiative. Only the Spanish-Ladino pairs were used for training this model.

The dataset was split into:

  • Training set: 245 examples (80%)
  • Validation set: 31 examples (10%)
  • Test set: 31 examples (10%)

Training procedure

The model was fine-tuned using the Seq2SeqTrainer class from Hugging Face's transformers library.

Training hyperparameters

The following hyperparameters were used:

  • learning_rate: 5.6e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: AdamW (betas=(0.9, 0.999), epsilon=1e-08)
  • lr_scheduler_type: linear
  • num_epochs: 2

Training results

Training Loss Epoch Step Validation Loss BLEU Gen Len
No log 1.0 10 3.5388 0.0 5.0
No log 2.0 20 3.3840 0.0 5.0

Framework versions

  • Transformers: 4.49.0
  • PyTorch: 2.6.0+cu124
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Training results

Training Loss Epoch Step Validation Loss Bleu Gen Len
No log 1.0 10 3.5388 0.0 5.0
No log 2.0 20 3.3840 0.0 5.0

Framework versions

  • Transformers 4.49.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.4.1
  • Tokenizers 0.21.1
Downloads last month
3
Safetensors
Model size
60.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rebego/t5-ladino-espanol

Finetuned
(2279)
this model