|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- en |
|
|
metrics: |
|
|
- bleu |
|
|
pipeline_tag: translation |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
Model Card for English-to-Darija Translation (mBART Fine-tuned Model) |
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This model is a fine-tuned version of the facebook/mbart-large-50-many-to-many-mmt model, |
|
|
specifically tailored for translating English text to Moroccan Darija in Arabic script. |
|
|
The model was trained on a custom dataset of English-Darija sentence pairs, |
|
|
and it has been designed to accurately capture the nuances of the Moroccan dialect. |
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
|
|
- **Developed by:** Aicha Lahnouki |
|
|
- **Finetuned from model:** facebook/mbart-large-50-many-to-many-mmt |
|
|
- **Model type:** Sequence-to-Sequence Translation (mBART architecture) |
|
|
- **Language(s) (NLP):** English (en_XX), Darija (ar_AR) |
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
This model is intended for translating English sentences into Moroccan Darija in Arabic script. |
|
|
It can be used in applications such as translation services, language learning tools, or chatbots. |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
This model was trained on 50% of the dataset provided by DODa, consisting of 45,000 rows. |
|
|
The testing was conducted on a sample of 100 sentences. Due to the reduced training data, |
|
|
the model might not capture the full linguistic diversity of English-to-Darija translations. |
|
|
Additionally, the limited test size may not fully represent the model's performance across all possible inputs, |
|
|
leading to potential biases or inaccuracies when applied to unseen or diverse data. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
You can start using the model for English-to-Darija translation with the following code: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Initialize the translation pipeline |
|
|
pipe = pipeline("translation", model="alpha2002/eng_alpha_darija", tokenizer="alpha2002/eng_alpha_darija") |
|
|
|
|
|
# Translate English to Darija |
|
|
input_text = "Hello, how are you?" |
|
|
translation = pipe(input_text, src_lang="en_XX", tgt_lang="ar_AR") |
|
|
|
|
|
print("Translation:", translation[0]['translation_text']) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on a custom dataset containing parallel English and Darija sentences. |
|
|
The dataset was preprocessed to include language tokens specific to mBART's requirements. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
The English text was tokenized with the <en_XX> token, and the Darija text with the <ar_AR> token. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** FP16 mixed precision was used during training to improve performance. |
|
|
Training was done on Google Colab using a subset of the data, with gradient accumulation to handle larger batch sizes. |
|
|
|
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
|
|
The model was trained for 2 epochs with a batch size of 4, using the Seq2SeqTrainer from the Hugging Face Transformers library. |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
The model was evaluated on a small set of held-out test sentences: 100 samples. |
|
|
|
|
|
|
|
|
#### Metrics |
|
|
|
|
|
BLEU score was used to measure translation accuracy. |
|
|
|
|
|
### Results |
|
|
|
|
|
The model achieved a BLEU score of 11.6 on the test set, |
|
|
indicating a reasonable level of accuracy given the complexity of translating between languages with different scripts and linguistic structures. |
|
|
|
|
|
|
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
- **Hardware Type:** Google Colab GPU (NVIDIA Tesla K80) |
|
|
- **Hours used:** Approximately 2 hours for training and 1hour for testing. |
|
|
|
|
|
|
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
@misc{lahnouki2024eng_alpha_darija, |
|
|
author = {Aicha Lahnouki}, |
|
|
title = {English-to-Darija Translation Model}, |
|
|
year = {2024}, |
|
|
url = {https://huggingface.co/alpha2002/eng_alpha_darija}, |
|
|
} |
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
|
|
Lahnouki Aicha |
|
|
## Model Card Contact |
|
|
|
|
|
email: aichalahnouki@gmail.com |