🔤 Naqta — نقطة

Arabic Punctuation Restoration

Naqta (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of XLM-RoBERTa Large.

💡 Try it live on the Hugging Face Space

✨ What Does It Restore?

Symbol	Name	Example
`.`	Period	نهاية الجملة
`،`	Arabic comma	فاصلة عربية
`؟`	Arabic question mark	علامة استفهام
`!`	Exclamation mark	علامة تعجب
`:`	Colon	نقطتان
`؛`	Arabic semicolon	فاصلة منقوطة
`-`	Dash	شرطة

🏆 Results

Validation Metrics (v11d — Final)

Metric	Score
🎯 Macro F1	0.8960
✅ Accuracy	0.9714

Per-Class F1 Score

Class	Symbol	F1	Performance
Exclamation	`!`	0.8897	🟢 Excellent
Arabic semicolon	`؛`	0.8042	🟢 Excellent
Question mark	`؟`	0.9665	🟢 Excellent
Dash	`-`	0.9007	🟢 Excellent
Arabic comma	`،`	0.8100	🟢 Excellent
Period	`.`	0.8968	🟢 Excellent

🗂️ Training Data

The model was trained on a large multi-source Arabic corpus totaling over 1.4 million paragraphs from six diverse sources, covering a broad range of Arabic writing styles and domains.

Corpus Sources

Source	Rows	Domain
ABC / UNPC	~1,020,000	News & formal Arabic (United Nations Parallel Corpus)
HF Tashkeel	~151,000	Vocalized Arabic text (diacritized corpus)
Hindawi E-Books	~100,000	Literary Arabic prose (novels & non-fiction)
Wikipedia (AR)	~98,500	Encyclopedia articles
CBT	~69,000	Classical Arabic books & religious texts
ARCD + XQuAD	~2,050	Arabic QA pairs (rich in question marks `؟`)
Total (raw)	~1,441,000	—

All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., «», …, parentheses) was removed before training.

Punctuation Coverage (raw corpus)

Mark	Name	Paragraphs	Coverage
`،`	Arabic comma	922,721	64.0%
`:`	Colon	230,150	16.0%
`؛`	Arabic semicolon	128,744	8.9%
`؟`	Question mark	50,282	3.5%
`!`	Exclamation	15,976	1.1%
`-`	Dash	~1	<0.1%

Data Balance Strategy

To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:

Strategy	Marks	Multiplier	Cap
Strong oversampling	`؟` `!`	×8	80,000 rows
Light oversampling	`؛` `-`	×6	80,000 rows

After oversampling, the combined training pool grew to ~2.4 million paragraphs.

Dataset Splits

Split	Sequences	Share
Train (capped)	1,000,000	85%
Validation	40,000	10%
Test	—	5%

Sliding-window context (window=3 sentences, stride=2) was applied to training data only
Validation and test sets remain un-windowed for clean, unbiased evaluation
Splits were stratified by the rarest punctuation mark in each sequence

Preprocessing

Arabic normalization: alef variants → ا, ya variants → ي, diacritics stripped
Label assigned per word = punctuation mark following that word
Multi-subword words: only the first subword receives the label; others are masked (-100)

⚙️ Model Architecture & Training

Setting	Value
Base model	`xlm-roberta-large` (~560M params)
Task	Token classification (8 labels)
Max sequence length	384 tokens
Training examples	1,000,000 (capped)
Validation examples	40,000

Two-Phase Training

Phase	Epochs	LR	Loss	Notes
Phase 1	2	2e-5	Cross-entropy + label smoothing	Full model fine-tuning
Phase 2	1	6e-6	Focal loss (γ=2.0) + class weights	Bottom 12 layers frozen

Class Weights

Rare class weights were additionally boosted:

Class	Boost
`؟`	×1.2
`!`	×3.0
`؛`	×2.0
`-`	×1.3

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

repo_id = "MostafaMaroof/Naqta"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
model.eval()

id2label = model.config.id2label

text = "بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024 وهو اعلى مستوى منذ خمس سنوات"
words = text.split()

inputs = tokenizer(
    words,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=384,
)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = logits.argmax(dim=-1)[0].tolist()
word_ids = inputs.word_ids(batch_index=0)

restored_words = []
previous_word_id = None
for token_id, word_id in zip(pred_ids, word_ids):
    if word_id is None or word_id == previous_word_id:
        continue
    word = words[word_id]
    label = id2label[token_id]
    if label != "O":
        word = word + label
    restored_words.append(word)
    previous_word_id = word_id

restored_text = " ".join(restored_words)
print(restored_text)
# → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات.

📖 Example

Input (unpunctuated):

اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها

Output (restored):

اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.

Question example:

من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع

من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟

🎯 Intended Use

Naqta is well-suited for:

🎙️ ASR post-processing — restoring punctuation in Arabic speech transcripts
📄 Readability enhancement — making raw Arabic text easier to read
🔧 NLP preprocessing — improving text quality for downstream Arabic NLP tasks
🔬 Research — Arabic punctuation restoration benchmark evaluation

⚠️ Limitations

Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input.
Performance may degrade on highly dialectal, technical, or domain-specific text.
The model does not predict quotation marks or dialogue markers («»).
Very short or fragmented text (< 5 words) may produce less reliable results.
The model predicts punctuation position only and does not perform grammar correction.

📜 License

This model is released under the MIT License.

🔗 Citation

If you use Naqta in your work, please reference:

@misc{naqta2025,
  title     = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
  author    = {MostafaMaroof},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/MostafaMaroof/Naqta}
}

Downloads last month: 2,687

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for MostafaMaroof/Naqta

Base model

FacebookAI/xlm-roberta-large

Finetuned

(981)

this model

Space using MostafaMaroof/Naqta 1

Evaluation results

Validation Macro F1 on Mixed Arabic punctuation restoration corpus
self-reported

0.896
Validation Accuracy on Mixed Arabic punctuation restoration corpus
self-reported

0.971