--- language: ar license: mit library_name: transformers pipeline_tag: text-classification datasets: - custom tags: - arabic - text-classification - iraqi-dialect - msa - message-classification - xlm-roberta widget: - text: "السلام عليكم ورحمة الله وبركاته" example_title: "Arabic Greeting" - text: "شلونك اليوم؟" example_title: "Iraqi Question" - text: "عندي مشكلة بالانترنت" example_title: "Iraqi Complaint" - text: "أحب القراءة كثيراً" example_title: "General Statement" model-index: - name: Arabic_MI_Classifier results: - task: type: text-classification name: Text Classification dataset: type: custom name: Arabic Messages Dataset metrics: - type: accuracy value: 0.95 name: Accuracy --- # Arabic Message Classification Model This model fine-tunes XLM-RoBERTa for Arabic message classification, supporting both Modern Standard Arabic (MSA) and Iraqi dialect. ## Model Description - **Developed by:** Ahmed Majid - **Model type:** XLM-RoBERTa for Sequence Classification - **Language(s):** Arabic (MSA and Iraqi dialect) - **License:** MIT - **Finetuned from model:** morit/arabic_xlm_xnli ## Intended Uses ### Direct Use This model can be used for: - Classifying Arabic messages in customer service systems - Organizing Arabic text messages by intent - Building chatbots for Arabic-speaking users - Content moderation for Arabic forums and social media ### Downstream Use The model can be further fine-tuned for: - Other Arabic dialects - Domain-specific message classification - Multi-label classification tasks ## How to Get Started with the Model Use the code below to get started with the model. ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline model_name = "ahmedmajid92/Arabic_MI_Classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) result = classifier("السلام عليكم") print(result) ``` ## Training Details ### Training Data The model was trained on a custom dataset of 5,000 Arabic messages: - 50% Modern Standard Arabic (MSA) - 50% Iraqi dialect - 4 classes: greeting, question, complaint, general - Balanced distribution: 1,250 examples per class ### Training Procedure #### Preprocessing - Tokenization using XLM-RoBERTa tokenizer - Maximum sequence length: 128 tokens - Padding and truncation applied #### Training Hyperparameters - **Training regime:** fp32 - **Epochs:** 20 - **Batch size:** 8 (training), 16 (evaluation) - **Learning rate:** Default AdamW - **Warmup steps:** Not specified - **Weight decay:** Default - **Optimizer:** AdamW #### Speeds, Sizes, Times - **Model size:** ~280M parameters - **Training time:** Approximately 2-3 hours on GPU - **Inference time:** ~50ms per message on GPU ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - 10% of the custom dataset (500 examples) - Balanced across all 4 classes - Mix of MSA and Iraqi dialect #### Factors - **Dialects:** MSA vs Iraqi dialect - **Message length:** Short to medium messages - **Domain:** General conversational messages #### Metrics - **Accuracy:** Primary metric - **Per-class performance:** Evaluated for each label ### Results The model achieves good performance across all classes with particular strength in: - Greeting detection (both MSA and Iraqi) - Question identification - Complaint classification - General statement recognition ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute). ## Technical Specifications ### Model Architecture and Objective - **Architecture:** XLM-RoBERTa with classification head - **Objective:** Multi-class text classification - **Base model:** morit/arabic_xlm_xnli - **Classification head:** 4 output classes ### Compute Infrastructure #### Hardware - **GPU:** NVIDIA GPU (recommended) - **Memory:** 8GB+ GPU memory recommended #### Software - **Framework:** PyTorch - **Libraries:** Transformers, Datasets - **Python version:** 3.8+ ## Bias, Risks, and Limitations ### Bias - The model may exhibit biases present in the training data - Performance may vary between different Arabic dialects - Regional variations in Iraqi dialect may not be fully captured ### Risks - Misclassification of ambiguous messages - Potential cultural bias in greeting/complaint detection - Limited generalization to formal/informal register variations ### Limitations - Only supports 4 predefined classes - Optimized for MSA and Iraqi dialect specifically - Maximum input length of 128 tokens - May not generalize well to other Arabic dialects ## Additional Information ### Author Ahmed Majid ### Licensing Information This model is released under the MIT License. ### Citation Information ```bibtex @misc{arabic-mi-classifier, title={Arabic Message Classification Model}, author={Ahmed Majid}, year={2025}, howpublished={Hugging Face Model Hub}, url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier} } ``` ### Acknowledgments - Based on the XLM-RoBERTa model by Facebook AI - Fine-tuned from morit/arabic_xlm_xnli - Trained on custom Arabic message dataset