Update README.md

000615a verified 6 months ago

7.11 kB

	---
	language: ar
	license: mit
	library_name: transformers
	pipeline_tag: text-classification
	datasets:
	- custom
	tags:
	- arabic
	- text-classification
	- iraqi-dialect
	- msa
	- message-classification
	- xlm-roberta
	- fine-tuned
	widget:
	- text: "السلام عليكم ورحمة الله وبركاته"
	example_title: "Arabic Greeting (MSA)"
	- text: "هلو شلونك اليوم؟"
	example_title: "Iraqi Greeting + Question"
	- text: "متى يبدأ الاجتماع؟"
	example_title: "Question (MSA)"
	- text: "عندي مشكلة بالانترنت"
	example_title: "Complaint (Iraqi)"
	- text: "أحب القراءة والكتابة"
	example_title: "General Statement (MSA)"
	- text: "الكهرباء نفطت"
	example_title: "Complaint (Iraqi)"
	model-index:
	- name: Arabic_MI_Classifier
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	type: custom
	name: Arabic Messages Dataset (MSA + Iraqi)
	metrics:
	- type: accuracy
	value: 0.95
	name: Accuracy
	base_model: morit/arabic_xlm_xnli
	---

	# Arabic Message Classification Model

	## Model Description

	This is a fine-tuned XLM-RoBERTa model for Arabic message classification, specifically designed to classify messages in both Modern Standard Arabic (MSA) and Iraqi dialect. The model is based on `morit/arabic_xlm_xnli` and has been fine-tuned on a custom dataset of 5,000 Arabic messages.

	## Model Details

	- Base Model: `morit/arabic_xlm_xnli`
	- Architecture: XLMRobertaForSequenceClassification
	- Language: Arabic (MSA and Iraqi dialect)
	- Task: Text Classification
	- Number of Labels: 4
	- Model Size: ~280M parameters

	## Labels

	The model classifies messages into four categories:

	\| Label ID \| Label Name \| Description \| Examples \|
	\|----------\|------------\|-------------\|----------\|
	\| 0 \| greeting \| Greetings and salutations \| "السلام عليكم", "هلو", "مرحبا" \|
	\| 1 \| question \| Questions and inquiries \| "كيف حالك؟", "شلونك؟", "متى الاجتماع؟" \|
	\| 2 \| complaint \| Complaints and problems \| "عندي مشكلة", "الانترنت معطل", "الجهاز لا يعمل" \|
	\| 3 \| general \| General statements \| "أحب القراءة", "أعمل مهندساً", "أسافر كثيراً" \|

	## Training Data

	The model was trained on a custom dataset containing:
	- 5,000 Arabic messages (50% MSA, 50% Iraqi dialect)
	- Balanced distribution: 1,250 examples per class
	- Train/Test Split: 90%/10%

	## Training Details

	- Training Epochs: 20
	- Batch Size: 8 (training), 16 (evaluation)
	- Learning Rate: Default AdamW optimizer
	- Maximum Sequence Length: 128 tokens
	- Evaluation Strategy: Every 500 steps

	## Usage

	### Using Transformers Pipeline

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	# Load the model and tokenizer
	model_name = "ahmedmajid92/Arabic_MI_Classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Create a classification pipeline
	classifier = pipeline(
	"text-classification",
	model=model,
	tokenizer=tokenizer
	)

	# Classify a message
	text = "السلام عليكم ورحمة الله"
	result = classifier(text)
	print(f"Label: {result[0]['label']}, Score: {result[0]['score']:.4f}")
	```

	### Using the Model Directly

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	# Load model and tokenizer
	model_name = "ahmedmajid92/Arabic_MI_Classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Tokenize input
	text = "شلونك اليوم؟"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

	# Get predictions
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_class_id = predictions.argmax().item()
	confidence = predictions.max().item()

	# Map to label names
	id2label = {0: "greeting", 1: "question", 2: "complaint", 3: "general"}
	predicted_label = id2label[predicted_class_id]

	print(f"Text: {text}")
	print(f"Predicted Label: {predicted_label}")
	print(f"Confidence: {confidence:.4f}")
	```

	### Gradio Web Interface

	```python
	import gradio as gr
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

	# Load model
	model_name = "ahmedmajid92/Arabic_MI_Classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	# Create classifier
	classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

	def classify_text(text):
	result = classifier(text)[0]
	return result["label"], float(result["score"])

	# Create Gradio interface
	iface = gr.Interface(
	fn=classify_text,
	inputs=gr.Textbox(lines=2, placeholder="اكتب جملتك هنا…", label="Input Text"),
	outputs=[
	gr.Textbox(label="Predicted Label"),
	gr.Number(label="Confidence")
	],
	title="Arabic Message Classifier",
	description="Classify Arabic messages into: greeting, question, complaint, or general."
	)

	iface.launch()
	```

	## Model Performance

	The model achieves good performance on the test set, particularly effective at:
	- Distinguishing between greetings and general statements
	- Identifying questions in both MSA and Iraqi dialect
	- Classifying complaints and technical issues
	- Handling mixed dialectal variations

	## Supported Dialects

	- Modern Standard Arabic (MSA): Formal Arabic text
	- Iraqi Dialect: Colloquial Iraqi Arabic expressions and vocabulary

	## Limitations

	- The model is specifically trained on MSA and Iraqi dialect; performance may vary with other Arabic dialects
	- Limited to 4 predefined categories
	- Performance depends on the similarity of input text to training data patterns
	- Maximum input length is 128 tokens

	## Ethical Considerations

	This model is intended for text classification purposes and should be used responsibly. Users should be aware that:
	- The model may reflect biases present in the training data
	- Performance may vary across different Arabic dialects not represented in training
	- The model should not be used for sensitive applications without proper validation

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{arabic-mi-classifier,
	title={Arabic Message Classification Model},
	author={Ahmed Majid},
	year={2025},
	howpublished={Hugging Face Model Hub},
	url={https://huggingface.co/ahmedmajid92/Arabic_MI_Classifier}
	}
	```

	## Model Card

	For more detailed information about the model's intended use, training data, and ethical considerations, please refer to the model card.

	## Contact

	For questions or issues, please contact ahmed1991madrid@gmail.com or create an issue in the model repository.

	## License

	This model is released under the MIT License, same as the base model `morit/arabic_xlm_xnli`.