---
language:
- ar
license: cc-by-nc-4.0
tags:
- bert
- Arabic BERT
- Levantine Dialect
- Shami
- Syrian Arabic
- Lebanese Arabic
- Jordanian Arabic
- Palestinian Arabic
- Masked Language Model
- Arabic NLP
datasets:
- QCRI/arabic_pos_dialect
- guymorlan/levanti
base_model: aubmindlab/bert-base-arabertv02-twitter
pipeline_tag: fill-mask
---

# ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸

**ShamiBERT** is a BERT-based language model specialized for **Levantine Arabic (الشامية)** — the dialect spoken in Syria, Lebanon, Jordan, and Palestine.

## Model Description

ShamiBERT was created by performing **continual pre-training** on `aubmindlab/bert-base-arabertv02-twitter` using Masked Language Modeling (MLM) on Levantine Arabic text data.

### Architecture
- **Base Model**: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
- **Architecture**: BERT-base (12 layers, 12 attention heads, 768 hidden size)
- **Task**: Masked Language Modeling (MLM)
- **Training**: Continual pre-training on Levantine dialect data

### Why AraBERT-Twitter as base?
1. Pre-trained on 77GB Arabic text + 60M Arabic tweets
2. Already handles dialectal Arabic and social media text
3. Supports emojis in vocabulary
4. Strong foundation for dialect-specific adaptation

## Training Data

ShamiBERT was trained on a combination of Levantine Arabic datasets:

| Dataset | Source | Description |
|---------|--------|-------------|
| QCRI Arabic POS (LEV) | HuggingFace | Levantine tweets with POS tags |
| Levanti | HuggingFace | Palestinian/Syrian/Lebanese/Jordanian sentences |
| Curated Shami | Manual | Hand-curated Levantine expressions and phrases |

### Training Details
- **Epochs**: 5
- **Learning Rate**: 2e-05
- **Batch Size**: 128 (effective)
- **Max Sequence Length**: 128
- **MLM Probability**: 0.15
- **Optimizer**: AdamW (β1=0.9, β2=0.999, ε=1e-6)
- **Weight Decay**: 0.01
- **Warmup**: 10%
- **Eval Perplexity**: 5.04

## Usage

### Fill-Mask (تعبئة القناع)
```python
from transformers import pipeline

fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")

# Shami examples
results = fill_mask("كيفك [MASK] الحمدلله")
for r in results[:3]:
    print(f"{r['token_str']} ({r['score']:.4f})")
```

### Feature Extraction (for downstream tasks)
```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")

text = "شو أخبارك يا زلمة"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]
```

### Preprocessing (recommended)
```python
from arabert.preprocess import ArabertPreprocessor

prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("كيفك يا خيي")
```

## Intended Uses

ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:
- **Sentiment Analysis** of Levantine social media
- **Text Classification** (topic, dialect sub-identification)
- **Named Entity Recognition** in Shami text
- **Feature Extraction** for downstream tasks
- **Dialect Identification** (Levantine vs other Arabic dialects)

## Limitations

- Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
- Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
- Based on AraBERT-Twitter which was trained with max_length=64
- Not suitable for MSA-heavy or non-Levantine dialect tasks

## Citation

```bibtex
@misc{shamibert2026,
    title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
    year={2026},
    note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}
```

## Acknowledgments

- **AraBERT** team (AUB MIND Lab) for the base model
- **ArSyra** team for Levantine dialect data
- **QCRI** for Arabic dialect resources
- **Unsloth** team for training optimizations