Darija-Translator

A Nano-Transformer for Tunisian Darija-to-English translation. Built from scratch on an RTX 3050 Laptop (4GB VRAM).

Model Details

  • Architecture: Encoder-Decoder Transformer
  • Parameters: ~15.6M
  • Tokenizer: SentencePiece BPE (16,000 tokens)
  • Training data: 35,977 cleaned Moroccan Darija pairs
  • Fine-tuning data: 120 hand-crafted Tunisian pairs
  • Final loss: 2.6264

Limitations

This model was trained primarily on Moroccan Darija data. Tunisian-specific vocabulary is limited to 120 hand-crafted pairs. Phase 2 field collection will expand the Tunisian foundation.

Author

Dhia Azizi (@dhiadev-tn) GitHub: https://github.com/Dhiadev-tn/darija-translator Dataset: https://huggingface.co/datasets/Dhiadev-tn/tunisian-darija-english

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train Dhiadev-tn/darija-translator