Update README.md

5aa8410 verified 10 months ago

7.05 kB

	---
	license: apache-2.0
	language:
	- ar
	- arz
	- ary
	base_model:
	- UBC-NLP/MARBERTv2
	pipeline_tag: text-classification
	library_name: transformers
	datasets:
	- iabufarha/ar_sarcasm
	- Abdelrahman-Rezk/Arabic_Dialect_Identification
	- asas-ai/DART
	- arbml/ArSarcasm_v2
	- evageon/IADD
	---

	# ✍🏻 MARBERTv2 Arabic Written Dialect Classifier

	## Model Overview

	This model is a fine-tuned version of [`UBC-NLP/MARBERTv2`](https://huggingface.co/UBC-NLP/MARBERTv2) for Arabic written dialect classification. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text.

	This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems.

	---

	## 📌 Model Details

	This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

	- MAGHREB (North African dialects)
	- LEV (Levantine dialects)
	- MSA (Modern Standard Arabic)
	- GLF (Gulf dialects)
	- EGY (Egyptian Arabic)

	It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.

	---

	## 📊 Labels (`id2label`)

	The model predicts one of the following five classes:

	```json
	{
	"0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.)
	"1": "LEV", // Levantine dialect (Lebanon, Syria, Jordan, Palestine)
	"2": "MSA", // Modern Standard Arabic
	"3": "GLF", // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.)
	"4": "EGY", // Egyptian dialect
	}
	```

	---

	## 📚 Training Data

	The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.

	### Distribution by Dialect:

	\| Dialect \| Count \|
	\|-----------\|----------\|
	\| GLF \| 253,553 \|
	\| LEV \| 243,025 \|
	\| MAGHREB \| 140,887 \|
	\| EGY \| 105,226 \|
	\| MSA \| 83,231 \|

	---

	## ⚙️ Training Details

	- Architecture: MARBERTv2 (BERT-based)
	- Task: Text Classification (Dialect Identification)
	- Objective: Multi-class classification with softmax over 5 dialect classes
	- Tokenizer: `UBC-NLP/MARBERTv2`

	---

	## 📂 Datasets Used

	Below is a detailed overview of the datasets used in training and/or considered during development:

	\| Dataset \| Brief Description \| Annotation strategy \| Provided Labels \| Current SOTA Performance \|
	\| :------------------------: \| :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: \| :---------------------: \| :-----------------------: \| :--------------------------: \|
	\| MADAR Subtask-1 (MADAR-6) \| A Collection of `parallel sentences (BTEC)` covering the dialects of `5 cities from the Arab World and MSA` in the travel domain `(10,000 sentences per city)` \| Manual \| 5 Arab Cities + MSA \| 92.5% Accuracy \|
	\| MADAR Subtask-1 (MADAR-26) \| A Collection of `parallel sentences (BTEC)` covering the dialects of `25 cities from the Arab World and MSA` in the travel domain `(2,000 sentences per city)` \| Manual \| 25 Arab Cities + MSA \| 67.32% F1-Score \|
	\| DART \| `25K tweets` that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects \| Manual \| 5 Arab Regions \| UNK \|
	\| ArSarcasm v1 \| `10,547 tweets` from `ASTD and SemEval datasets` for Sarcasm detection with the dilaect information added in \| Manual \| 4 Arab Regions + MSA \| UNK \|
	\| ArSarcasm v2 \| ArSarcasm-v2 dataset contains `15,548 Tweets` and is an extension of the original ArSarcasm dataset `(Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)` \| Manual \| 4 Arab Regions + MSA \| UNK \|
	\| IADD \| `Five publicly available corpora` were identified, analyzed and filtered to build IADD `(AOC, DART, PADIC, SHAMI and TSAC)` \| ________ \| 5 Regions and 9 Countries \| UNK \|
	\| QADI \| `540k tweets` (30k per country on average) with a total of 8.8M words \| Automatic \| 18 Arab Countries \| 60.6% \|
	\| AOC \| The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:`AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY` \| Manual \| 3 Arab Regions + MSA \| UNK \|
	\| NADI-2020 \| `25,957 Tweets` from 100 Arab provinces and 21 Arab countries \| Automatic \| 100 Prov. and 21 Coun. \| 6.39% - 26.78% \|

	---

	## 💡 Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة"
	inputs = tokenizer(text, return_tensors="pt")

	# Run inference
	with torch.inference_mode():
	logits = model(**inputs).logits

	pred = torch.argmax(logits, dim=-1).item()

	print(f"Predicted Dialect: {model.config.id2label[pred]}")
	```

	---

	## ✨ Acknowledgements

	- MARBERTv2 team at UBC-NLP
	- Contributors of the Arabic dialect datasets used in training

	---

	## 📝 Citation

	If you use this model in your research or application, please cite:

	```bibtex
	@misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier,
	author = {Ibrahim Amin},
	title = {MARBERTv2 Arabic Written Dialect Classifier},
	year = {2025},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}},
	}
	```