| --- |
| language: |
| - ar |
| license: cc-by-nc-4.0 |
| tags: |
| - bert |
| - Arabic BERT |
| - Levantine Dialect |
| - Shami |
| - Syrian Arabic |
| - Lebanese Arabic |
| - Jordanian Arabic |
| - Palestinian Arabic |
| - Masked Language Model |
| - Arabic NLP |
| datasets: |
| - QCRI/arabic_pos_dialect |
| - guymorlan/levanti |
| base_model: aubmindlab/bert-base-arabertv02-twitter |
| pipeline_tag: fill-mask |
| --- |
| |
| # ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸 |
|
|
| **ShamiBERT** is a BERT-based language model specialized for **Levantine Arabic (الشامية)** — the dialect spoken in Syria, Lebanon, Jordan, and Palestine. |
|
|
| ## Model Description |
|
|
| ShamiBERT was created by performing **continual pre-training** on `aubmindlab/bert-base-arabertv02-twitter` using Masked Language Modeling (MLM) on Levantine Arabic text data. |
|
|
| ### Architecture |
| - **Base Model**: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter) |
| - **Architecture**: BERT-base (12 layers, 12 attention heads, 768 hidden size) |
| - **Task**: Masked Language Modeling (MLM) |
| - **Training**: Continual pre-training on Levantine dialect data |
|
|
| ### Why AraBERT-Twitter as base? |
| 1. Pre-trained on 77GB Arabic text + 60M Arabic tweets |
| 2. Already handles dialectal Arabic and social media text |
| 3. Supports emojis in vocabulary |
| 4. Strong foundation for dialect-specific adaptation |
|
|
| ## Training Data |
|
|
| ShamiBERT was trained on a combination of Levantine Arabic datasets: |
|
|
| | Dataset | Source | Description | |
| |---------|--------|-------------| |
| | QCRI Arabic POS (LEV) | HuggingFace | Levantine tweets with POS tags | |
| | Levanti | HuggingFace | Palestinian/Syrian/Lebanese/Jordanian sentences | |
| | Curated Shami | Manual | Hand-curated Levantine expressions and phrases | |
|
|
| ### Training Details |
| - **Epochs**: 5 |
| - **Learning Rate**: 2e-05 |
| - **Batch Size**: 128 (effective) |
| - **Max Sequence Length**: 128 |
| - **MLM Probability**: 0.15 |
| - **Optimizer**: AdamW (β1=0.9, β2=0.999, ε=1e-6) |
| - **Weight Decay**: 0.01 |
| - **Warmup**: 10% |
| - **Eval Perplexity**: 5.04 |
|
|
| ## Usage |
|
|
| ### Fill-Mask (تعبئة القناع) |
| ```python |
| from transformers import pipeline |
| |
| fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT") |
| |
| # Shami examples |
| results = fill_mask("كيفك [MASK] الحمدلله") |
| for r in results[:3]: |
| print(f"{r['token_str']} ({r['score']:.4f})") |
| ``` |
|
|
| ### Feature Extraction (for downstream tasks) |
| ```python |
| from transformers import AutoTokenizer, AutoModel |
| import torch |
| |
| tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT") |
| model = AutoModel.from_pretrained("mabahboh/ShamiBERT") |
| |
| text = "شو أخبارك يا زلمة" |
| inputs = tokenizer(text, return_tensors="pt") |
| outputs = model(**inputs) |
| |
| # Use [CLS] token embedding for classification |
| cls_embedding = outputs.last_hidden_state[:, 0, :] |
| ``` |
|
|
| ### Preprocessing (recommended) |
| ```python |
| from arabert.preprocess import ArabertPreprocessor |
| |
| prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True) |
| text = prep.preprocess("كيفك يا خيي") |
| ``` |
|
|
| ## Intended Uses |
|
|
| ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect: |
| - **Sentiment Analysis** of Levantine social media |
| - **Text Classification** (topic, dialect sub-identification) |
| - **Named Entity Recognition** in Shami text |
| - **Feature Extraction** for downstream tasks |
| - **Dialect Identification** (Levantine vs other Arabic dialects) |
|
|
| ## Limitations |
|
|
| - Training data is limited compared to large-scale models like SaudiBERT (26.3GB) |
| - Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian) |
| - Based on AraBERT-Twitter which was trained with max_length=64 |
| - Not suitable for MSA-heavy or non-Levantine dialect tasks |
| |
| ## Citation |
| |
| ```bibtex |
| @misc{shamibert2026, |
| title={ShamiBERT: A BERT Model for Levantine Arabic Dialect}, |
| year={2026}, |
| note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data} |
| } |
| ``` |
| |
| ## Acknowledgments |
| |
| - **AraBERT** team (AUB MIND Lab) for the base model |
| - **ArSyra** team for Levantine dialect data |
| - **QCRI** for Arabic dialect resources |
| - **Unsloth** team for training optimizations |
| |