Model Card for Model ID
Model Details
Model Description
This model is a fine-tuned version of the Facebook M2M100 (418M parameters), specifically adapted for translating Algerian dialect (ARQ) into Modern Standard Arabic (ARB). The fine-tuning process used a parallel dataset of 137,000 sentence pairs to improve the model’s translation accuracy for this specific language pair.
- Model type: Multilingual Machine Translation (Transformer, Encoder–Decoder)
- Language(s) (NLP): Algerian Dialect (ARQ) → Modern Standard Arabic (ARB)
- Finetuned from model [optional]: facebook/m2m100_418M
Uses
Direct Use
This model can be used for:
• Translating Algerian dialect (ARQ) text into Modern Standard Arabic (ARB).
• Useful for NLP applications focusing on Arabic text normalization and understanding.
• Improving Arabic language understanding systems with a focus on Algerian dialect.
Downstream Use
This model could be used in language translation applications, chatbots, or other NLP systems that require Algerian dialect processing.
Bias, Risks, and Limitations
• Bias: The model might reflect biases present in the training data, particularly linguistic or cultural biases.
• Risks: Incorrect or misleading translations may occur, especially with highly ambiguous or slang terms.
• Limitations: It is specific to Algerian dialect (ARQ) and Modern Standard Arabic (ARB) and may not generalize to other dialects, languages, or specialized domains.
Recommendations
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
# Model repository
repo_name = "Aicha-zkr/M2M100-Algerian-Dialect-to-MSA"
# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(repo_name)
tokenizer = AutoTokenizer.from_pretrained(repo_name)
# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Example Algerian dialect sentences
source_sentences = [
"كي العادة راني نخدم بزاف وما لقيتش وقت نرتاح",
"اليوم الجو حار، ما قدرتش نخرج",
"واش راك داير؟ نحتاجو نتلاقاو",
"أنا عندي مشكلة في الانترنت، ما يشتغلش",
"راني محتار بين هاد الخيارين",
"واش رأيك في هاد الفيلم؟ كان مليح",
"شحال من مرة قلتلك ما تديرهاش؟",
"أحتاج نروح عند الطبيب بكري",
"إلى كانت الخدمة صعيبة، خليها",
"خليت الدار وراحت الرحلة كلها كانت ممتازة"
]
# Translate each sentence to Modern Standard Arabic
for source_sentence in source_sentences:
# Tokenize
encoded_input = tokenizer(
source_sentence,
return_tensors="pt",
padding=True,
truncation=True
).to(device)
# Generate translation
arabic_lang_id = tokenizer.get_lang_id("arb") # Modern Standard Arabic
generated_tokens = model.generate(
**encoded_input,
forced_bos_token_id=arabic_lang_id
)
# Decode output
translated_sentence = tokenizer.batch_decode(
generated_tokens, skip_special_tokens=True
)[0]
print(f"Original: {source_sentence}")
print(f"Translation: {translated_sentence}\n")
Training Details
Training Data
The model was fine-tuned on a dataset of 137,000 sentence pairs containing Algerian dialect (ARQ) and Modern Standard Arabic (ARB). This parallel dataset allowed the model to specialize in translating this specific dialect.
• 137k sentence pairs ARQ → ARB.
• 14k high-quality human-labeled data.
• GPT-4o-generated translations (manually verified).
• Additional manually translated samples.
Training Hyperparameters
• Max sequence length: 128 tokens.
• Batch size: 16.
• Learning rate: 5e-5 (linear decay).
• Epochs: 2.
• Precision: Mixed FP16.
• Training time: ~5.5 hours on Kaggle P100 GPU.
Model Card Contact
Email : aicha.zenakhri@ensia.edu.dz
- Downloads last month
- -