๐Ÿ”ค Naqta โ€” ู†ู‚ุทุฉ

Arabic Punctuation Restoration

Model Language Task License Macro F1


Naqta (Arabic: ู†ู‚ุทุฉ, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of XLM-RoBERTa Large.

๐Ÿ’ก Try it live on the Hugging Face Space


โœจ What Does It Restore?

Symbol Name Example
. Period ู†ู‡ุงูŠุฉ ุงู„ุฌู…ู„ุฉ
ุŒ Arabic comma ูุงุตู„ุฉ ุนุฑุจูŠุฉ
ุŸ Arabic question mark ุนู„ุงู…ุฉ ุงุณุชูู‡ุงู…
! Exclamation mark ุนู„ุงู…ุฉ ุชุนุฌุจ
: Colon ู†ู‚ุทุชุงู†
ุ› Arabic semicolon ูุงุตู„ุฉ ู…ู†ู‚ูˆุทุฉ
- Dash ุดุฑุทุฉ

๐Ÿ† Results

Validation Metrics (v11d โ€” Final)

Metric Score
๐ŸŽฏ Macro F1 0.8960
โœ… Accuracy 0.9714

Per-Class F1 Score

Class Symbol F1 Performance
Exclamation ! 0.8897 ๐ŸŸข Excellent
Arabic semicolon ุ› 0.8042 ๐ŸŸข Excellent
Question mark ุŸ 0.9665 ๐ŸŸข Excellent
Dash - 0.9007 ๐ŸŸข Excellent
Arabic comma ุŒ 0.8100 ๐ŸŸข Excellent
Period . 0.8968 ๐ŸŸข Excellent

๐Ÿ—‚๏ธ Training Data

The model was trained on a large multi-source Arabic corpus totaling over 1.4 million paragraphs from six diverse sources, covering a broad range of Arabic writing styles and domains.

Corpus Sources

Source Rows Domain
ABC / UNPC ~1,020,000 News & formal Arabic (United Nations Parallel Corpus)
HF Tashkeel ~151,000 Vocalized Arabic text (diacritized corpus)
Hindawi E-Books ~100,000 Literary Arabic prose (novels & non-fiction)
Wikipedia (AR) ~98,500 Encyclopedia articles
CBT ~69,000 Classical Arabic books & religious texts
ARCD + XQuAD ~2,050 Arabic QA pairs (rich in question marks ุŸ)
Total (raw) ~1,441,000 โ€”

All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., ยซยป, โ€ฆ, parentheses) was removed before training.

Punctuation Coverage (raw corpus)

Mark Name Paragraphs Coverage
ุŒ Arabic comma 922,721 64.0%
: Colon 230,150 16.0%
ุ› Arabic semicolon 128,744 8.9%
ุŸ Question mark 50,282 3.5%
! Exclamation 15,976 1.1%
- Dash ~1 <0.1%

Data Balance Strategy

To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:

Strategy Marks Multiplier Cap
Strong oversampling ุŸ ! ร—8 80,000 rows
Light oversampling ุ› - ร—6 80,000 rows

After oversampling, the combined training pool grew to ~2.4 million paragraphs.

Dataset Splits

Split Sequences Share
Train (capped) 1,000,000 85%
Validation 40,000 10%
Test โ€” 5%
  • Sliding-window context (window=3 sentences, stride=2) was applied to training data only
  • Validation and test sets remain un-windowed for clean, unbiased evaluation
  • Splits were stratified by the rarest punctuation mark in each sequence

Preprocessing

  • Arabic normalization: alef variants โ†’ ุง, ya variants โ†’ ูŠ, diacritics stripped
  • Label assigned per word = punctuation mark following that word
  • Multi-subword words: only the first subword receives the label; others are masked (-100)

โš™๏ธ Model Architecture & Training

Setting Value
Base model xlm-roberta-large (~560M params)
Task Token classification (8 labels)
Max sequence length 384 tokens
Training examples 1,000,000 (capped)
Validation examples 40,000

Two-Phase Training

Phase Epochs LR Loss Notes
Phase 1 2 2e-5 Cross-entropy + label smoothing Full model fine-tuning
Phase 2 1 6e-6 Focal loss (ฮณ=2.0) + class weights Bottom 12 layers frozen

Class Weights

Rare class weights were additionally boosted:

Class Boost
ุŸ ร—1.2
! ร—3.0
ุ› ร—2.0
- ร—1.3

๐Ÿš€ Quick Start

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

repo_id = "MostafaMaroof/Naqta"

tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForTokenClassification.from_pretrained(repo_id)
model.eval()

id2label = model.config.id2label

text = "ุจู„ุบุช ู†ุณุจุฉ ุงู„ู†ู…ูˆ ุงู„ุงู‚ุชุตุงุฏูŠ 4.7 ุจุงู„ู…ุฆุฉ ุฎู„ุงู„ ุงู„ุฑุจุน ุงู„ุซุงู„ุซ ู…ู† ุนุงู… 2024 ูˆู‡ูˆ ุงุนู„ู‰ ู…ุณุชูˆู‰ ู…ู†ุฐ ุฎู…ุณ ุณู†ูˆุงุช"
words = text.split()

inputs = tokenizer(
    words,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=384,
)

with torch.no_grad():
    logits = model(**inputs).logits

pred_ids = logits.argmax(dim=-1)[0].tolist()
word_ids = inputs.word_ids(batch_index=0)

restored_words = []
previous_word_id = None
for token_id, word_id in zip(pred_ids, word_ids):
    if word_id is None or word_id == previous_word_id:
        continue
    word = words[word_id]
    label = id2label[token_id]
    if label != "O":
        word = word + label
    restored_words.append(word)
    previous_word_id = word_id

restored_text = " ".join(restored_words)
print(restored_text)
# โ†’ ุจู„ุบุช ู†ุณุจุฉ ุงู„ู†ู…ูˆ ุงู„ุงู‚ุชุตุงุฏูŠ 4.7 ุจุงู„ู…ุฆุฉ ุฎู„ุงู„ ุงู„ุฑุจุน ุงู„ุซุงู„ุซ ู…ู† ุนุงู… 2024ุŒ ูˆู‡ูˆ ุงุนู„ู‰ ู…ุณุชูˆู‰ ู…ู†ุฐ ุฎู…ุณ ุณู†ูˆุงุช.

๐Ÿ“– Example

Input (unpunctuated):

ุงุฐุง ุงุฑุฏุช ุงู† ุชู†ุฌุญ ููŠ ุญูŠุงุชูƒ ูุนู„ูŠูƒ ุงู† ุชุญุฏุฏ ุงู‡ุฏุงููƒ ูˆุงุถุญุฉ ูˆุงู† ุชุนู…ู„ ุจุฌุฏ ูˆุงุณุชู…ุฑุงุฑูŠุฉ ูˆู„ุง ุชูŠุฃุณ ุนู†ุฏ ุงูˆู„ ุนู‚ุจุฉ ุชูˆุงุฌู‡ู‡ุง

Output (restored):

ุงุฐุง ุงุฑุฏุช ุงู† ุชู†ุฌุญ ููŠ ุญูŠุงุชูƒุŒ ูุนู„ูŠูƒ ุงู† ุชุญุฏุฏ ุงู‡ุฏุงููƒ ูˆุงุถุญุฉุŒ ูˆุงู† ุชุนู…ู„ ุจุฌุฏ ูˆุงุณุชู…ุฑุงุฑูŠุฉุŒ ูˆู„ุง ุชูŠุฃุณ ุนู†ุฏ ุงูˆู„ ุนู‚ุจุฉ ุชูˆุงุฌู‡ู‡ุง.

Question example:

ู…ู† ุงุฎุชุฑุน ุงู„ู‡ุงุชู ูˆููŠ ุงูŠ ุณู†ุฉ ุชู… ุฐู„ูƒ ูˆู…ุง ู‡ูŠ ุงู‡ู…ูŠุฉ ู‡ุฐุง ุงู„ุงุฎุชุฑุงุน
ู…ู† ุงุฎุชุฑุน ุงู„ู‡ุงุชูุŒ ูˆููŠ ุงูŠ ุณู†ุฉ ุชู… ุฐู„ูƒุŒ ูˆู…ุง ู‡ูŠ ุงู‡ู…ูŠุฉ ู‡ุฐุง ุงู„ุงุฎุชุฑุงุนุŸ

๐ŸŽฏ Intended Use

Naqta is well-suited for:

  • ๐ŸŽ™๏ธ ASR post-processing โ€” restoring punctuation in Arabic speech transcripts
  • ๐Ÿ“„ Readability enhancement โ€” making raw Arabic text easier to read
  • ๐Ÿ”ง NLP preprocessing โ€” improving text quality for downstream Arabic NLP tasks
  • ๐Ÿ”ฌ Research โ€” Arabic punctuation restoration benchmark evaluation

โš ๏ธ Limitations

  • Punctuation restoration is partly stylistic โ€” multiple valid outputs may exist for a single input.
  • Performance may degrade on highly dialectal, technical, or domain-specific text.
  • The model does not predict quotation marks or dialogue markers (ยซยป).
  • Very short or fragmented text (< 5 words) may produce less reliable results.
  • The model predicts punctuation position only and does not perform grammar correction.

๐Ÿ“œ License

This model is released under the MIT License.


๐Ÿ”— Citation

If you use Naqta in your work, please reference:

@misc{naqta2025,
  title     = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
  author    = {MostafaMaroof},
  year      = {2025},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/MostafaMaroof/Naqta}
}
Downloads last month
48
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for MostafaMaroof/Naqta

Finetuned
(928)
this model

Space using MostafaMaroof/Naqta 1

Evaluation results

  • Validation Macro F1 on Mixed Arabic punctuation restoration corpus
    self-reported
    0.896
  • Validation Accuracy on Mixed Arabic punctuation restoration corpus
    self-reported
    0.971