ShamiBERT / README.md
mabahboh's picture
Upload ShamiBERT - Levantine Arabic BERT model
30ea70c verified
---
language:
- ar
license: cc-by-nc-4.0
tags:
- bert
- Arabic BERT
- Levantine Dialect
- Shami
- Syrian Arabic
- Lebanese Arabic
- Jordanian Arabic
- Palestinian Arabic
- Masked Language Model
- Arabic NLP
datasets:
- QCRI/arabic_pos_dialect
- guymorlan/levanti
base_model: aubmindlab/bert-base-arabertv02-twitter
pipeline_tag: fill-mask
---
# ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸
**ShamiBERT** is a BERT-based language model specialized for **Levantine Arabic (الشامية)** — the dialect spoken in Syria, Lebanon, Jordan, and Palestine.
## Model Description
ShamiBERT was created by performing **continual pre-training** on `aubmindlab/bert-base-arabertv02-twitter` using Masked Language Modeling (MLM) on Levantine Arabic text data.
### Architecture
- **Base Model**: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
- **Architecture**: BERT-base (12 layers, 12 attention heads, 768 hidden size)
- **Task**: Masked Language Modeling (MLM)
- **Training**: Continual pre-training on Levantine dialect data
### Why AraBERT-Twitter as base?
1. Pre-trained on 77GB Arabic text + 60M Arabic tweets
2. Already handles dialectal Arabic and social media text
3. Supports emojis in vocabulary
4. Strong foundation for dialect-specific adaptation
## Training Data
ShamiBERT was trained on a combination of Levantine Arabic datasets:
| Dataset | Source | Description |
|---------|--------|-------------|
| QCRI Arabic POS (LEV) | HuggingFace | Levantine tweets with POS tags |
| Levanti | HuggingFace | Palestinian/Syrian/Lebanese/Jordanian sentences |
| Curated Shami | Manual | Hand-curated Levantine expressions and phrases |
### Training Details
- **Epochs**: 5
- **Learning Rate**: 2e-05
- **Batch Size**: 128 (effective)
- **Max Sequence Length**: 128
- **MLM Probability**: 0.15
- **Optimizer**: AdamW (β1=0.9, β2=0.999, ε=1e-6)
- **Weight Decay**: 0.01
- **Warmup**: 10%
- **Eval Perplexity**: 5.04
## Usage
### Fill-Mask (تعبئة القناع)
```python
from transformers import pipeline
fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")
# Shami examples
results = fill_mask("كيفك [MASK] الحمدلله")
for r in results[:3]:
print(f"{r['token_str']} ({r['score']:.4f})")
```
### Feature Extraction (for downstream tasks)
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")
text = "شو أخبارك يا زلمة"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]
```
### Preprocessing (recommended)
```python
from arabert.preprocess import ArabertPreprocessor
prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("كيفك يا خيي")
```
## Intended Uses
ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:
- **Sentiment Analysis** of Levantine social media
- **Text Classification** (topic, dialect sub-identification)
- **Named Entity Recognition** in Shami text
- **Feature Extraction** for downstream tasks
- **Dialect Identification** (Levantine vs other Arabic dialects)
## Limitations
- Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
- Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
- Based on AraBERT-Twitter which was trained with max_length=64
- Not suitable for MSA-heavy or non-Levantine dialect tasks
## Citation
```bibtex
@misc{shamibert2026,
title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
year={2026},
note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}
```
## Acknowledgments
- **AraBERT** team (AUB MIND Lab) for the base model
- **ArSyra** team for Levantine dialect data
- **QCRI** for Arabic dialect resources
- **Unsloth** team for training optimizations