Arabic Punctuation Restoration โ€” AraBERT

Fine-tuned AraBERT v0.2 for automatic punctuation restoration in Arabic text.

The model takes unpunctuated Arabic text as input and predicts the punctuation mark that should follow each word.

Labels

ID Symbol Description
0 โ€” No punctuation
1 . Period
2 ุŒ Arabic comma
3 ุŸ Question mark
4 ! Exclamation mark
5 ุ› Arabic semicolon
6 : Colon

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("makdadTaleb/arabic-punctuation-arabert")
model = AutoModelForTokenClassification.from_pretrained("makdadTaleb/arabic-punctuation-arabert")
model.eval()

text = "ุฐู‡ุจ ุงู„ูˆู„ุฏ ุฅู„ู‰ ุงู„ู…ุฏุฑุณุฉ ูˆุนุงุฏ ููŠ ุงู„ู…ุณุงุก"
words = text.split()

encoding = tokenizer(
    words,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    padding=True
)

with torch.no_grad():
    logits = model(**encoding).logits

preds = logits.argmax(dim=-1)[0]
word_ids = encoding.word_ids(batch_index=0)

id2label = {0: "", 1: ".", 2: "ุŒ", 3: "ุŸ", 4: "!", 5: "ุ›", 6: ":"}

result = []
prev = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is None or word_idx == prev:
        continue
    result.append(words[word_idx] + id2label[preds[idx].item()])
    prev = word_idx

print(" ".join(result))

Training

Dataset: SSAC-UNPC โ€” a large-scale Arabic corpus of UN proceedings.

The dataset was built with a custom pipeline to address severe class imbalance:

  • Extracted all sentences containing rare punctuation (!, ุŸ)
  • Extracted sentences rich in comma and semicolon combinations
  • Final training set: ~400,000 sentences

Training setup:

  • Base model: aubmindlab/bert-base-arabertv02
  • Optimizer: AdamW (lr = 2e-5)
  • Scheduler: Linear warmup (10%) + linear decay
  • Loss: Weighted CrossEntropyLoss
  • Mixed precision: AMP (FP16)
  • Early stopping: patience = 2
  • Epochs: 3 | Batch size: 16 | Max length: 128

Two-Stage Decision for Arabic Comma:

The model uses a custom post-processing step for the comma class (ุŒ). If the comma probability exceeds a confidence threshold (ฯ„ = 0.70) and is higher than the no-punctuation probability, the model predicts comma. Otherwise, it selects the best non-comma alternative. This improved comma F1 from 0.685 โ†’ 0.749.

Results

Evaluated on a held-out validation set (~1.2M tokens):

Class Precision Recall F1
O (no punct) 0.995 0.964 0.979
. Period 0.993 0.999 0.996
ุŒ Comma 0.646 0.891 0.749
ุŸ Question 0.955 0.965 0.960
! Exclamation 0.520 0.361 0.426
ุ› Semicolon 0.432 0.792 0.559
: Colon 0.705 0.919 0.798
Weighted Avg 0.970 0.960 0.963

Repository

Full training code and data pipeline: github.com/MakdadTaleb/arabic-punctuation-restoration

Downloads last month
21
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for makdadTaleb/arabic-punctuation-arabert

Finetuned
(4021)
this model