Darija-Translator
A Nano-Transformer for Tunisian Darija-to-English translation. Built from scratch on an RTX 3050 Laptop (4GB VRAM).
Model Details
- Architecture: Encoder-Decoder Transformer
- Parameters: ~15.6M
- Tokenizer: SentencePiece BPE (16,000 tokens)
- Training data: 35,977 cleaned Moroccan Darija pairs
- Fine-tuning data: 120 hand-crafted Tunisian pairs
- Final loss: 2.6264
Limitations
This model was trained primarily on Moroccan Darija data. Tunisian-specific vocabulary is limited to 120 hand-crafted pairs. Phase 2 field collection will expand the Tunisian foundation.
Author
Dhia Azizi (@dhiadev-tn) GitHub: https://github.com/Dhiadev-tn/darija-translator Dataset: https://huggingface.co/datasets/Dhiadev-tn/tunisian-darija-english