metadata
language:
- ar
license: cc-by-nc-4.0
tags:
- bert
- Arabic BERT
- Levantine Dialect
- Shami
- Syrian Arabic
- Lebanese Arabic
- Jordanian Arabic
- Palestinian Arabic
- Masked Language Model
- Arabic NLP
datasets:
- QCRI/arabic_pos_dialect
- guymorlan/levanti
base_model: aubmindlab/bert-base-arabertv02-twitter
pipeline_tag: fill-mask
ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸
ShamiBERT is a BERT-based language model specialized for Levantine Arabic (الشامية) — the dialect spoken in Syria, Lebanon, Jordan, and Palestine.
Model Description
ShamiBERT was created by performing continual pre-training on aubmindlab/bert-base-arabertv02-twitter using Masked Language Modeling (MLM) on Levantine Arabic text data.
Architecture
- Base Model: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
- Architecture: BERT-base (12 layers, 12 attention heads, 768 hidden size)
- Task: Masked Language Modeling (MLM)
- Training: Continual pre-training on Levantine dialect data
Why AraBERT-Twitter as base?
- Pre-trained on 77GB Arabic text + 60M Arabic tweets
- Already handles dialectal Arabic and social media text
- Supports emojis in vocabulary
- Strong foundation for dialect-specific adaptation
Training Data
ShamiBERT was trained on a combination of Levantine Arabic datasets:
| Dataset | Source | Description |
|---|---|---|
| QCRI Arabic POS (LEV) | HuggingFace | Levantine tweets with POS tags |
| Levanti | HuggingFace | Palestinian/Syrian/Lebanese/Jordanian sentences |
| Curated Shami | Manual | Hand-curated Levantine expressions and phrases |
Training Details
- Epochs: 5
- Learning Rate: 2e-05
- Batch Size: 128 (effective)
- Max Sequence Length: 128
- MLM Probability: 0.15
- Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-6)
- Weight Decay: 0.01
- Warmup: 10%
- Eval Perplexity: 5.04
Usage
Fill-Mask (تعبئة القناع)
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")
# Shami examples
results = fill_mask("كيفك [MASK] الحمدلله")
for r in results[:3]:
print(f"{r['token_str']} ({r['score']:.4f})")
Feature Extraction (for downstream tasks)
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")
text = "شو أخبارك يا زلمة"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]
Preprocessing (recommended)
from arabert.preprocess import ArabertPreprocessor
prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("كيفك يا خيي")
Intended Uses
ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:
- Sentiment Analysis of Levantine social media
- Text Classification (topic, dialect sub-identification)
- Named Entity Recognition in Shami text
- Feature Extraction for downstream tasks
- Dialect Identification (Levantine vs other Arabic dialects)
Limitations
- Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
- Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
- Based on AraBERT-Twitter which was trained with max_length=64
- Not suitable for MSA-heavy or non-Levantine dialect tasks
Citation
@misc{shamibert2026,
title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
year={2026},
note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}
Acknowledgments
- AraBERT team (AUB MIND Lab) for the base model
- ArSyra team for Levantine dialect data
- QCRI for Arabic dialect resources
- Unsloth team for training optimizations