Update README.md

76ab7dd verified 12 months ago

6.3 kB

license: apache-2.0
language:
  - ar
  - arz
  - ary
base_model:
  - UBC-NLP/MARBERTv2
pipeline_tag: text-classification
library_name: transformers

🏷️ Model Card: marbertv2-arabic-written-dialect-classifier

Model Name: marbertv2-arabic-written-dialect-classifier
Base Model: UBC-NLP/MARBERTv2
Languages: Arabic (dialectal + MSA)
Author: IbrahimAmin

📌 Model Description

This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

MAGHREB (North African dialects)
LEV (Levantine dialects)
MSA (Modern Standard Arabic)
GLF (Gulf dialects)
EGY (Egyptian Arabic)

It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.

📊 Labels

The model predicts one of the following five classes:

"id2label": {
  "0": "MAGHREB",
  "1": "LEV",
  "2": "MSA",
  "3": "GLF",
  "4": "EGY"
}

📚 Training Data

The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.

Distribution by Dialect:

Dialect	Count
GLF	253,553
LEV	243,025
MAGHREB	140,887
EGY	105,226
MSA	83,231

⚙️ Training Details

Architecture: MARBERTv2 (BERT-based)
Task: Text Classification (Dialect Identification)
Objective: Multi-class classification with softmax over 5 dialect classes
Tokenizer: UBC-NLP/MARBERTv2

📂 Datasets Used

Below is a detailed overview of the datasets used in training and/or considered during development:

Dataset	Brief Description	Annotation strategy	Provided Labels	Current SOTA Performance	Notes
MADAR Subtask-1 (MADAR-6)	A Collection of `parallel sentences (BTEC)` covering the dialects of `5 cities from the Arab World and MSA` in the travel domain `(10,000 sentences per city)`	Manual	5 Arab Cities + MSA	92.5% Accuracy	ADDED
MADAR Subtask-1 (MADAR-26)	A Collection of `parallel sentences (BTEC)` covering the dialects of `25 cities from the Arab World and MSA` in the travel domain `(2,000 sentences per city)`	Manual	25 Arab Cities + MSA	67.32% F1-Score	ADDED
DART	`25K tweets` that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects	Manual	5 Arab Regions	UNK	ADDED
ArSarcasm v1	`10,547 tweets` from `ASTD and SemEval datasets` for Sarcasm detection with the dilaect information added in	Manual	4 Arab Regions + MSA	UNK	ADDED
ArSarcasm v2	ArSarcasm-v2 dataset contains `15,548 Tweets` and is an extension of the original ArSarcasm dataset `(Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)`	Manual	4 Arab Regions + MSA	UNK	ADDED
IADD	`Five publicly available corpora` were identified, analyzed and filtered to build IADD `(AOC, DART, PADIC, SHAMI and TSAC)`	________	5 Regions and 9 Countries	UNK	ADDED
QADI	`540k tweets` (30k per country on average) with a total of 8.8M words	Automatic	18 Arab Countries	60.6%	ADDED
AOC	The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:`AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY`	Manual	3 Arab Regions + MSA	UNK	ADDED
NADI-2020	`25,957 Tweets` from 100 Arab provinces and 21 Arab countries	Automatic	100 Prov. and 21 Coun.	6.39%, 26.78%	ADDED

💡 Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")

✨ Acknowledgements

MARBERTv2 team at UBC-NLP
Contributors of the Arabic dialect datasets used in training