Arabic Punctuation Restoration — AraBERT

Fine-tuned AraBERT v0.2 for automatic punctuation restoration in Arabic text.

The model takes unpunctuated Arabic text as input and predicts the punctuation mark that should follow each word.

Labels

ID	Symbol	Description
0	—	No punctuation
1	`.`	Period
2	`،`	Arabic comma
3	`؟`	Question mark
4	`!`	Exclamation mark
5	`؛`	Arabic semicolon
6	`:`	Colon

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("makdadTaleb/arabic-punctuation-arabert")
model = AutoModelForTokenClassification.from_pretrained("makdadTaleb/arabic-punctuation-arabert")
model.eval()

text = "ذهب الولد إلى المدرسة وعاد في المساء"
words = text.split()

encoding = tokenizer(
    words,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    padding=True
)

with torch.no_grad():
    logits = model(**encoding).logits

preds = logits.argmax(dim=-1)[0]
word_ids = encoding.word_ids(batch_index=0)

id2label = {0: "", 1: ".", 2: "،", 3: "؟", 4: "!", 5: "؛", 6: ":"}

result = []
prev = None
for idx, word_idx in enumerate(word_ids):
    if word_idx is None or word_idx == prev:
        continue
    result.append(words[word_idx] + id2label[preds[idx].item()])
    prev = word_idx

print(" ".join(result))

Training

Dataset: SSAC-UNPC — a large-scale Arabic corpus of UN proceedings.

The dataset was built with a custom pipeline to address severe class imbalance:

Extracted all sentences containing rare punctuation (!, ؟)
Extracted sentences rich in comma and semicolon combinations
Final training set: ~400,000 sentences

Training setup:

Base model: aubmindlab/bert-base-arabertv02
Optimizer: AdamW (lr = 2e-5)
Scheduler: Linear warmup (10%) + linear decay
Loss: Weighted CrossEntropyLoss
Mixed precision: AMP (FP16)
Early stopping: patience = 2
Epochs: 3 | Batch size: 16 | Max length: 128

Two-Stage Decision for Arabic Comma:

The model uses a custom post-processing step for the comma class (،). If the comma probability exceeds a confidence threshold (τ = 0.70) and is higher than the no-punctuation probability, the model predicts comma. Otherwise, it selects the best non-comma alternative. This improved comma F1 from 0.685 → 0.749.

Results

Evaluated on a held-out validation set (~1.2M tokens):

Class	Precision	Recall	F1
O (no punct)	0.995	0.964	0.979
`.` Period	0.993	0.999	0.996
`،` Comma	0.646	0.891	0.749
`؟` Question	0.955	0.965	0.960
`!` Exclamation	0.520	0.361	0.426
`؛` Semicolon	0.432	0.792	0.559
`:` Colon	0.705	0.919	0.798
Weighted Avg	0.970	0.960	0.963

Repository

Full training code and data pipeline: github.com/MakdadTaleb/arabic-punctuation-arabert

Downloads last month: 279

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for makdadTaleb/arabic-punctuation-arabert

Base model

aubmindlab/bert-base-arabertv02

Finetuned

(4036)

this model