IbrahimAmin's picture
Update README.md
76ab7dd verified
|
raw
history blame
6.3 kB
metadata
license: apache-2.0
language:
  - ar
  - arz
  - ary
base_model:
  - UBC-NLP/MARBERTv2
pipeline_tag: text-classification
library_name: transformers

๐Ÿท๏ธ Model Card: marbertv2-arabic-written-dialect-classifier

Model Name: marbertv2-arabic-written-dialect-classifier
Base Model: UBC-NLP/MARBERTv2
Languages: Arabic (dialectal + MSA)
Author: IbrahimAmin


๐Ÿ“Œ Model Description

This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:

  • MAGHREB (North African dialects)
  • LEV (Levantine dialects)
  • MSA (Modern Standard Arabic)
  • GLF (Gulf dialects)
  • EGY (Egyptian Arabic)

It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.


๐Ÿ“Š Labels

The model predicts one of the following five classes:

"id2label": {
  "0": "MAGHREB",
  "1": "LEV",
  "2": "MSA",
  "3": "GLF",
  "4": "EGY"
}

๐Ÿ“š Training Data

The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.

Distribution by Dialect:

Dialect Count
GLF 253,553
LEV 243,025
MAGHREB 140,887
EGY 105,226
MSA 83,231

โš™๏ธ Training Details

  • Architecture: MARBERTv2 (BERT-based)
  • Task: Text Classification (Dialect Identification)
  • Objective: Multi-class classification with softmax over 5 dialect classes
  • Tokenizer: UBC-NLP/MARBERTv2

๐Ÿ“‚ Datasets Used

Below is a detailed overview of the datasets used in training and/or considered during development:

Dataset Brief Description Annotation strategy Provided Labels Current SOTA Performance Notes
MADAR Subtask-1 (MADAR-6) A Collection of parallel sentences (BTEC) covering the dialects of 5 cities from the Arab World and MSA in the travel domain (10,000 sentences per city) Manual 5 Arab Cities + MSA 92.5% Accuracy ADDED
MADAR Subtask-1 (MADAR-26) A Collection of parallel sentences (BTEC) covering the dialects of 25 cities from the Arab World and MSA in the travel domain (2,000 sentences per city) Manual 25 Arab Cities + MSA 67.32% F1-Score ADDED
DART 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects Manual 5 Arab Regions UNK ADDED
ArSarcasm v1 10,547 tweets from ASTD and SemEval datasets for Sarcasm detection with the dilaect information added in Manual 4 Arab Regions + MSA UNK ADDED
ArSarcasm v2 ArSarcasm-v2 dataset contains 15,548 Tweets and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets) Manual 4 Arab Regions + MSA UNK ADDED
IADD Five publicly available corpora were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC) ________ 5 Regions and 9 Countries UNK ADDED
QADI 540k tweets (30k per country on average) with a total of 8.8M words Automatic 18 Arab Countries 60.6% ADDED
AOC The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabeโ€™ from EGY Manual 3 Arab Regions + MSA UNK ADDED
NADI-2020 25,957 Tweets from 100 Arab provinces and 21 Arab countries Automatic 100 Prov. and 21 Coun. 6.39%, 26.78% ADDED

๐Ÿ’ก Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "ุงู„ุฏู†ูŠุง ู…ุด ู…ุณุชุงู‡ู„ุฉ ุชุฌุฑูŠ ูƒุฏู‡ุŒ ุฎุฏ ูˆู‚ุชูƒ ูˆุงุณุชู…ุชุน ุจุงู„ุญุงุฌุฉ ุงู„ุจุณูŠุทุฉ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=1).item()

print(f"Predicted Dialect: {model.config.id2label[pred]}")

โœจ Acknowledgements

  • MARBERTv2 team at UBC-NLP
  • Contributors of the Arabic dialect datasets used in training