--- language: ar license: mit library_name: transformers pipeline_tag: text-classification datasets: - custom tags: - arabic - text-classification - iraqi-dialect - msa - message-classification - xlm-roberta - fine-tuned widget: - text: "السلام عليكم ورحمة الله وبركاته" example_title: "Arabic Greeting (MSA)" - text: "هلو شلونك اليوم؟" example_title: "Iraqi Greeting + Question" - text: "متى يبدأ الاجتماع؟" example_title: "Question (MSA)" - text: "عندي مشكلة بالانترنت" example_title: "Complaint (Iraqi)" - text: "أحب القراءة والكتابة" example_title: "General Statement (MSA)" - text: "الكهرباء نفطت" example_title: "Complaint (Iraqi)" model-index: - name: Arabic_MI_Classifier results: - task: type: text-classification name: Text Classification dataset: type: custom name: Arabic Messages Dataset (MSA + Iraqi) metrics: - type: accuracy value: 0.95 name: Accuracy base_model: morit/arabic_xlm_xnli --- # Arabic Message Classification Model ## Model Description This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages. ## Model Details - **Base Model**: `morit/arabic_xlm_xnli` - **Architecture**: XLMRobertaForSequenceClassification - **Language**: Arabic (MSA and Iraqi dialect) - **Task**: Text Classification - **Number of Labels**: 4 - **Model Size**: ~280M parameters ## Labels The model classifies messages into four categories: | Label ID | Label Name | Description | Examples | |----------|------------|-------------|----------| | 0 | greeting | Greetings and salutations | "السلام عليكم", "هلو", "مرحبا" | | 1 | question | Questions and inquiries | "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" | | 2 | complaint | Complaints and problems | "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" | | 3 | general | General statements | "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" | ## Training Data The model was trained on a custom dataset containing: - **5,000 Arabic messages** (50% MSA, 50% Iraqi dialect) - **Balanced distribution**: 1,250 examples per class - **Train/Test Split**: 90%/10% ## Training Details - **Training Epochs**: 20 - **Batch Size**: 8 (training), 16 (evaluation) - **Learning Rate**: Default AdamW optimizer - **Maximum Sequence Length**: 128 tokens - **Evaluation Strategy**: Every 500 steps ## Usage ### Using Transformers Pipeline ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline # Load the model and tokenizer model_name = "ahmedmajid92/Arabic_MI_Classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Create a classification pipeline classifier = pipeline( "text-classification", model=model, tokenizer=tokenizer ) # Classify a message text = "السلام عليكم ورحمة الله" result = classifier(text) print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}") ``` ### Using the Model Directly ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "ahmedmajid92/Arabic_MI_Classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Tokenize input text = "شلونك اليوم؟" inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128) # Get predictions with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class_id = predictions.argmax().item() confidence = predictions.max().item() # Map to label names id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"} predicted_label = id2label[predicted_class_id] print(f"Text: {text}") print(f"Predicted Label: {predicted_label}") print(f"Confidence: {confidence:.4f}") ``` ### Gradio Web Interface ```python import gradio as gr from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline # Load model model_name = "ahmedmajid92/Arabic_MI_Classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Create classifier classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) def classify_text(text): result = classifier(text)[0] return result["label"], float(result["score"]) # Create Gradio interface iface = gr.Interface( fn=classify_text, inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"), outputs=[ gr.Textbox(label="Predicted Label"), gr.Number(label="Confidence") ], title="Arabic Message Classifier", description="Classify Arabic messages into: greeting, question, complaint, or general." ) iface.launch() ``` ## Model Performance The model achieves good performance on the test set, particularly effective at: - Distinguishing between greetings and general statements - Identifying questions in both MSA and Iraqi dialect - Classifying complaints and technical issues - Handling mixed dialectal variations ## Supported Dialects - **Modern Standard Arabic (MSA)**: Formal Arabic text - **Iraqi Dialect**: Colloquial Iraqi Arabic expressions and vocabulary ## Limitations - The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects - Limited to 4 predefined categories - Performance depends on the similarity of input text to training data patterns - Maximum input length is 128 tokens ## Ethical Considerations This model is intended for text classification purposes and should be used responsibly. Users should be aware that: - The model may reflect biases present in the training data - Performance may vary across different Arabic dialects not represented in training - The model should not be used for sensitive applications without proper validation ## Citation If you use this model in your research, please cite: ```bibtex @misc{arabic-mi-classifier, title={Arabic Message Classification Model}, author={Ahmed Majid}, year={2025}, howpublished={Hugging Face Model Hub}, url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier} } ``` ## Model Card For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card. ## Contact For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository. ## License This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.