--- language: - ar license: cc-by-nc-4.0 tags: - bert - Arabic BERT - Levantine Dialect - Shami - Syrian Arabic - Lebanese Arabic - Jordanian Arabic - Palestinian Arabic - Masked Language Model - Arabic NLP datasets: - QCRI/arabic_pos_dialect - guymorlan/levanti base_model: aubmindlab/bert-base-arabertv02-twitter pipeline_tag: fill-mask --- # ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸 **ShamiBERT** is a BERT-based language model specialized for **Levantine Arabic (الشامية)** — the dialect spoken in Syria, Lebanon, Jordan, and Palestine. ## Model Description ShamiBERT was created by performing **continual pre-training** on `aubmindlab/bert-base-arabertv02-twitter` using Masked Language Modeling (MLM) on Levantine Arabic text data. ### Architecture - **Base Model**: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter) - **Architecture**: BERT-base (12 layers, 12 attention heads, 768 hidden size) - **Task**: Masked Language Modeling (MLM) - **Training**: Continual pre-training on Levantine dialect data ### Why AraBERT-Twitter as base? 1. Pre-trained on 77GB Arabic text + 60M Arabic tweets 2. Already handles dialectal Arabic and social media text 3. Supports emojis in vocabulary 4. Strong foundation for dialect-specific adaptation ## Training Data ShamiBERT was trained on a combination of Levantine Arabic datasets: | Dataset | Source | Description | |---------|--------|-------------| | QCRI Arabic POS (LEV) | HuggingFace | Levantine tweets with POS tags | | Levanti | HuggingFace | Palestinian/Syrian/Lebanese/Jordanian sentences | | Curated Shami | Manual | Hand-curated Levantine expressions and phrases | ### Training Details - **Epochs**: 5 - **Learning Rate**: 2e-05 - **Batch Size**: 128 (effective) - **Max Sequence Length**: 128 - **MLM Probability**: 0.15 - **Optimizer**: AdamW (β1=0.9, β2=0.999, ε=1e-6) - **Weight Decay**: 0.01 - **Warmup**: 10% - **Eval Perplexity**: 5.04 ## Usage ### Fill-Mask (تعبئة القناع) ```python from transformers import pipeline fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT") # Shami examples results = fill_mask("كيفك [MASK] الحمدلله") for r in results[:3]: print(f"{r['token_str']} ({r['score']:.4f})") ``` ### Feature Extraction (for downstream tasks) ```python from transformers import AutoTokenizer, AutoModel import torch tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT") model = AutoModel.from_pretrained("mabahboh/ShamiBERT") text = "شو أخبارك يا زلمة" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # Use [CLS] token embedding for classification cls_embedding = outputs.last_hidden_state[:, 0, :] ``` ### Preprocessing (recommended) ```python from arabert.preprocess import ArabertPreprocessor prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True) text = prep.preprocess("كيفك يا خيي") ``` ## Intended Uses ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect: - **Sentiment Analysis** of Levantine social media - **Text Classification** (topic, dialect sub-identification) - **Named Entity Recognition** in Shami text - **Feature Extraction** for downstream tasks - **Dialect Identification** (Levantine vs other Arabic dialects) ## Limitations - Training data is limited compared to large-scale models like SaudiBERT (26.3GB) - Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian) - Based on AraBERT-Twitter which was trained with max_length=64 - Not suitable for MSA-heavy or non-Levantine dialect tasks ## Citation ```bibtex @misc{shamibert2026, title={ShamiBERT: A BERT Model for Levantine Arabic Dialect}, year={2026}, note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data} } ``` ## Acknowledgments - **AraBERT** team (AUB MIND Lab) for the base model - **ArSyra** team for Levantine dialect data - **QCRI** for Arabic dialect resources - **Unsloth** team for training optimizations