ShamiBERT / README.md
mabahboh's picture
Upload ShamiBERT - Levantine Arabic BERT model
30ea70c verified
metadata
language:
  - ar
license: cc-by-nc-4.0
tags:
  - bert
  - Arabic BERT
  - Levantine Dialect
  - Shami
  - Syrian Arabic
  - Lebanese Arabic
  - Jordanian Arabic
  - Palestinian Arabic
  - Masked Language Model
  - Arabic NLP
datasets:
  - QCRI/arabic_pos_dialect
  - guymorlan/levanti
base_model: aubmindlab/bert-base-arabertv02-twitter
pipeline_tag: fill-mask

ShamiBERT 🇸🇾🇱🇧🇯🇴🇵🇸

ShamiBERT is a BERT-based language model specialized for Levantine Arabic (الشامية) — the dialect spoken in Syria, Lebanon, Jordan, and Palestine.

Model Description

ShamiBERT was created by performing continual pre-training on aubmindlab/bert-base-arabertv02-twitter using Masked Language Modeling (MLM) on Levantine Arabic text data.

Architecture

  • Base Model: AraBERTv0.2-Twitter (aubmindlab/bert-base-arabertv02-twitter)
  • Architecture: BERT-base (12 layers, 12 attention heads, 768 hidden size)
  • Task: Masked Language Modeling (MLM)
  • Training: Continual pre-training on Levantine dialect data

Why AraBERT-Twitter as base?

  1. Pre-trained on 77GB Arabic text + 60M Arabic tweets
  2. Already handles dialectal Arabic and social media text
  3. Supports emojis in vocabulary
  4. Strong foundation for dialect-specific adaptation

Training Data

ShamiBERT was trained on a combination of Levantine Arabic datasets:

Dataset Source Description
QCRI Arabic POS (LEV) HuggingFace Levantine tweets with POS tags
Levanti HuggingFace Palestinian/Syrian/Lebanese/Jordanian sentences
Curated Shami Manual Hand-curated Levantine expressions and phrases

Training Details

  • Epochs: 5
  • Learning Rate: 2e-05
  • Batch Size: 128 (effective)
  • Max Sequence Length: 128
  • MLM Probability: 0.15
  • Optimizer: AdamW (β1=0.9, β2=0.999, ε=1e-6)
  • Weight Decay: 0.01
  • Warmup: 10%
  • Eval Perplexity: 5.04

Usage

Fill-Mask (تعبئة القناع)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="mabahboh/ShamiBERT")

# Shami examples
results = fill_mask("كيفك [MASK] الحمدلله")
for r in results[:3]:
    print(f"{r['token_str']} ({r['score']:.4f})")

Feature Extraction (for downstream tasks)

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("mabahboh/ShamiBERT")
model = AutoModel.from_pretrained("mabahboh/ShamiBERT")

text = "شو أخبارك يا زلمة"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use [CLS] token embedding for classification
cls_embedding = outputs.last_hidden_state[:, 0, :]

Preprocessing (recommended)

from arabert.preprocess import ArabertPreprocessor

prep = ArabertPreprocessor(model_name="bert-base-arabertv02-twitter", keep_emojis=True)
text = prep.preprocess("كيفك يا خيي")

Intended Uses

ShamiBERT is designed for NLP tasks involving Levantine Arabic dialect:

  • Sentiment Analysis of Levantine social media
  • Text Classification (topic, dialect sub-identification)
  • Named Entity Recognition in Shami text
  • Feature Extraction for downstream tasks
  • Dialect Identification (Levantine vs other Arabic dialects)

Limitations

  • Training data is limited compared to large-scale models like SaudiBERT (26.3GB)
  • Performance may vary across sub-dialects (Syrian vs Lebanese vs Jordanian vs Palestinian)
  • Based on AraBERT-Twitter which was trained with max_length=64
  • Not suitable for MSA-heavy or non-Levantine dialect tasks

Citation

@misc{shamibert2026,
    title={ShamiBERT: A BERT Model for Levantine Arabic Dialect},
    year={2026},
    note={Continual pre-training of AraBERT-Twitter on Levantine Arabic data}
}

Acknowledgments

  • AraBERT team (AUB MIND Lab) for the base model
  • ArSyra team for Levantine dialect data
  • QCRI for Arabic dialect resources
  • Unsloth team for training optimizations