eng_alpha_darija / README.md

alpha2002

Update README.md

f222e67 verified over 1 year ago

3.92 kB

	---
	library_name: transformers
	language:
	- en
	metrics:
	- bleu
	pipeline_tag: translation
	---

	# Model Card for Model ID

	Model Card for English-to-Darija Translation (mBART Fine-tuned Model)


	## Model Details

	### Model Description

	This model is a fine-tuned version of the facebook/mbart-large-50-many-to-many-mmt model,
	specifically tailored for translating English text to Moroccan Darija in Arabic script.
	The model was trained on a custom dataset of English-Darija sentence pairs,
	and it has been designed to accurately capture the nuances of the Moroccan dialect.
	This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

	- Developed by: Aicha Lahnouki
	- Finetuned from model: facebook/mbart-large-50-many-to-many-mmt
	- Model type: Sequence-to-Sequence Translation (mBART architecture)
	- Language(s) (NLP): English (en_XX), Darija (ar_AR)


	## Uses

	### Direct Use

	This model is intended for translating English sentences into Moroccan Darija in Arabic script.
	It can be used in applications such as translation services, language learning tools, or chatbots.


	## Bias, Risks, and Limitations

	This model was trained on 50% of the dataset provided by DODa, consisting of 45,000 rows.
	The testing was conducted on a sample of 100 sentences. Due to the reduced training data,
	the model might not capture the full linguistic diversity of English-to-Darija translations.
	Additionally, the limited test size may not fully represent the model's performance across all possible inputs,
	leading to potential biases or inaccuracies when applied to unseen or diverse data.


	## How to Get Started with the Model

	You can start using the model for English-to-Darija translation with the following code:

	```python
	from transformers import pipeline

	# Initialize the translation pipeline
	pipe = pipeline("translation", model="alpha2002/eng_alpha_darija", tokenizer="alpha2002/eng_alpha_darija")

	# Translate English to Darija
	input_text = "Hello, how are you?"
	translation = pipe(input_text, src_lang="en_XX", tgt_lang="ar_AR")

	print("Translation:", translation[0]['translation_text'])
	```

	## Training Details

	### Training Data

	The model was trained on a custom dataset containing parallel English and Darija sentences.
	The dataset was preprocessed to include language tokens specific to mBART's requirements.

	### Training Procedure


	#### Preprocessing [optional]

	The English text was tokenized with the <en_XX> token, and the Darija text with the <ar_AR> token.

	#### Training Hyperparameters

	- Training regime: FP16 mixed precision was used during training to improve performance.
	Training was done on Google Colab using a subset of the data, with gradient accumulation to handle larger batch sizes.


	#### Speeds, Sizes, Times [optional]

	The model was trained for 2 epochs with a batch size of 4, using the Seq2SeqTrainer from the Hugging Face Transformers library.

	## Evaluation


	### Testing Data, Factors & Metrics

	#### Testing Data

	The model was evaluated on a small set of held-out test sentences: 100 samples.


	#### Metrics

	BLEU score was used to measure translation accuracy.

	### Results

	The model achieved a BLEU score of 11.6 on the test set,
	indicating a reasonable level of accuracy given the complexity of translating between languages with different scripts and linguistic structures.



	## Environmental Impact

	- Hardware Type: Google Colab GPU (NVIDIA Tesla K80)
	- Hours used: Approximately 2 hours for training and 1hour for testing.



	## Citation [optional]

	BibTeX:

	@misc{lahnouki2024eng_alpha_darija,
	author = {Aicha Lahnouki},
	title = {English-to-Darija Translation Model},
	year = {2024},
	url = {https://huggingface.co/alpha2002/eng_alpha_darija},
	}

	## Model Card Authors [optional]

	Lahnouki Aicha
	## Model Card Contact

	email: aichalahnouki@gmail.com