| --- |
| language: |
| - ar |
| license: mit |
| tags: |
| - arabic |
| - punctuation-restoration |
| - token-classification |
| - xlm-roberta |
| - natural-language-processing |
| pipeline_tag: token-classification |
| base_model: xlm-roberta-large |
| model-index: |
| - name: Naqta |
| results: |
| - task: |
| type: token-classification |
| name: Arabic Punctuation Restoration |
| dataset: |
| name: Mixed Arabic punctuation restoration corpus |
| type: custom |
| metrics: |
| - type: f1 |
| value: 0.8960 |
| name: Validation Macro F1 |
| - type: accuracy |
| value: 0.9714 |
| name: Validation Accuracy |
| --- |
| |
| <div align="center"> |
|
|
| # 🔤 Naqta — نقطة |
|
|
| ### Arabic Punctuation Restoration |
|
|
| [](https://huggingface.co/MostafaMaroof/Naqta) |
| [](https://huggingface.co/MostafaMaroof/Naqta) |
| [](https://huggingface.co/MostafaMaroof/Naqta) |
| [](https://opensource.org/licenses/MIT) |
| [](https://huggingface.co/MostafaMaroof/Naqta) |
|
|
| </div> |
|
|
| --- |
|
|
| **Naqta** (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of **XLM-RoBERTa Large**. |
|
|
| > 💡 **Try it live** on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta) |
|
|
| --- |
|
|
| ## ✨ What Does It Restore? |
|
|
| | Symbol | Name | Example | |
| |:---:|---|---| |
| | `.` | Period | نهاية الجملة | |
| | `،` | Arabic comma | فاصلة عربية | |
| | `؟` | Arabic question mark | علامة استفهام | |
| | `!` | Exclamation mark | علامة تعجب | |
| | `:` | Colon | نقطتان | |
| | `؛` | Arabic semicolon | فاصلة منقوطة | |
| | `-` | Dash | شرطة | |
|
|
| --- |
|
|
| ## 🏆 Results |
|
|
| ### Validation Metrics (v11d — Final) |
|
|
| | Metric | Score | |
| |---|---:| |
| | 🎯 **Macro F1** | **0.8960** | |
| | ✅ Accuracy | 0.9714 | |
|
|
| ### Per-Class F1 Score |
|
|
| | Class | Symbol | F1 | Performance | |
| |---|:---:|---:|---| |
| | Exclamation | `!` | 0.8897 | 🟢 Excellent | |
| | Arabic semicolon | `؛` | 0.8042 | 🟢 Excellent | |
| | Question mark | `؟` | 0.9665 | 🟢 Excellent | |
| | Dash | `-` | 0.9007 | 🟢 Excellent | |
| | Arabic comma | `،` | 0.8100 | 🟢 Excellent | |
| | Period | `.` | 0.8968 | 🟢 Excellent | |
|
|
| --- |
|
|
| ## 🗂️ Training Data |
|
|
| The model was trained on a large multi-source Arabic corpus totaling over **1.4 million paragraphs** from six diverse sources, covering a broad range of Arabic writing styles and domains. |
|
|
| ### Corpus Sources |
|
|
| | Source | Rows | Domain | |
| |---|---:|---| |
| | **ABC / UNPC** | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) | |
| | **HF Tashkeel** | ~151,000 | Vocalized Arabic text (diacritized corpus) | |
| | **Hindawi E-Books** | ~100,000 | Literary Arabic prose (novels & non-fiction) | |
| | **Wikipedia (AR)** | ~98,500 | Encyclopedia articles | |
| | **CBT** | ~69,000 | Classical Arabic books & religious texts | |
| | **ARCD + XQuAD** | ~2,050 | Arabic QA pairs (rich in question marks `؟`) | |
| | **Total (raw)** | **~1,441,000** | — | |
|
|
| > All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training. |
|
|
| ### Punctuation Coverage (raw corpus) |
|
|
| | Mark | Name | Paragraphs | Coverage | |
| |:---:|---|---:|---:| |
| | `،` | Arabic comma | 922,721 | 64.0% | |
| | `:` | Colon | 230,150 | 16.0% | |
| | `؛` | Arabic semicolon | 128,744 | 8.9% | |
| | `؟` | Question mark | 50,282 | 3.5% | |
| | `!` | Exclamation | 15,976 | 1.1% | |
| | `-` | Dash | ~1 | <0.1% | |
|
|
| ### Data Balance Strategy |
|
|
| To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied: |
|
|
| | Strategy | Marks | Multiplier | Cap | |
| |---|:---:|:---:|---:| |
| | Strong oversampling | `؟` `!` | ×8 | 80,000 rows | |
| | Light oversampling | `؛` `-` | ×6 | 80,000 rows | |
|
|
| After oversampling, the combined training pool grew to **~2.4 million paragraphs**. |
|
|
| ### Dataset Splits |
|
|
| | Split | Sequences | Share | |
| |---|---:|---:| |
| | Train (capped) | 1,000,000 | 85% | |
| | Validation | 40,000 | 10% | |
| | Test | — | 5% | |
|
|
| - **Sliding-window context** (window=3 sentences, stride=2) was applied to training data only |
| - Validation and test sets remain un-windowed for clean, unbiased evaluation |
| - Splits were stratified by the rarest punctuation mark in each sequence |
|
|
| ### Preprocessing |
|
|
| - Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped |
| - Label assigned per word = punctuation mark **following** that word |
| - Multi-subword words: only the first subword receives the label; others are masked (`-100`) |
|
|
| --- |
|
|
| ## ⚙️ Model Architecture & Training |
|
|
| | Setting | Value | |
| |---|---| |
| | Base model | `xlm-roberta-large` (~560M params) | |
| | Task | Token classification (8 labels) | |
| | Max sequence length | 384 tokens | |
| | Training examples | 1,000,000 (capped) | |
| | Validation examples | 40,000 | |
|
|
| ### Two-Phase Training |
|
|
| | Phase | Epochs | LR | Loss | Notes | |
| |---|:---:|---|---|---| |
| | Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning | |
| | Phase 2 | 1 | 6e-6 | Focal loss (γ=2.0) + class weights | Bottom 12 layers frozen | |
|
|
| ### Class Weights |
|
|
| Rare class weights were additionally boosted: |
|
|
| | Class | Boost | |
| |:---:|---| |
| | `؟` | ×1.2 | |
| | `!` | ×3.0 | |
| | `؛` | ×2.0 | |
| | `-` | ×1.3 | |
|
|
| --- |
|
|
| ## 🚀 Quick Start |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForTokenClassification |
| import torch |
| |
| repo_id = "MostafaMaroof/Naqta" |
| |
| tokenizer = AutoTokenizer.from_pretrained(repo_id) |
| model = AutoModelForTokenClassification.from_pretrained(repo_id) |
| model.eval() |
| |
| id2label = model.config.id2label |
| |
| text = "بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024 وهو اعلى مستوى منذ خمس سنوات" |
| words = text.split() |
| |
| inputs = tokenizer( |
| words, |
| is_split_into_words=True, |
| return_tensors="pt", |
| truncation=True, |
| max_length=384, |
| ) |
| |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| |
| pred_ids = logits.argmax(dim=-1)[0].tolist() |
| word_ids = inputs.word_ids(batch_index=0) |
| |
| restored_words = [] |
| previous_word_id = None |
| for token_id, word_id in zip(pred_ids, word_ids): |
| if word_id is None or word_id == previous_word_id: |
| continue |
| word = words[word_id] |
| label = id2label[token_id] |
| if label != "O": |
| word = word + label |
| restored_words.append(word) |
| previous_word_id = word_id |
| |
| restored_text = " ".join(restored_words) |
| print(restored_text) |
| # → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات. |
| ``` |
|
|
| --- |
|
|
| ## 📖 Example |
|
|
| **Input** (unpunctuated): |
| ``` |
| اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها |
| ``` |
|
|
| **Output** (restored): |
| ``` |
| اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها. |
| ``` |
|
|
| **Question example:** |
| ``` |
| من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع |
| ``` |
| ``` |
| من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟ |
| ``` |
|
|
| --- |
|
|
| ## 🎯 Intended Use |
|
|
| Naqta is well-suited for: |
|
|
| - 🎙️ **ASR post-processing** — restoring punctuation in Arabic speech transcripts |
| - 📄 **Readability enhancement** — making raw Arabic text easier to read |
| - 🔧 **NLP preprocessing** — improving text quality for downstream Arabic NLP tasks |
| - 🔬 **Research** — Arabic punctuation restoration benchmark evaluation |
|
|
| --- |
|
|
| ## ⚠️ Limitations |
|
|
| - Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input. |
| - Performance may degrade on highly dialectal, technical, or domain-specific text. |
| - The model does not predict quotation marks or dialogue markers (`«»`). |
| - Very short or fragmented text (< 5 words) may produce less reliable results. |
| - The model predicts punctuation position only and does not perform grammar correction. |
|
|
| --- |
|
|
| ## 📜 License |
|
|
| This model is released under the **MIT License**. |
|
|
| --- |
|
|
| ## 🔗 Citation |
|
|
| If you use Naqta in your work, please reference: |
|
|
| ```bibtex |
| @misc{naqta2025, |
| title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa}, |
| author = {MostafaMaroof}, |
| year = {2025}, |
| publisher = {Hugging Face}, |
| url = {https://huggingface.co/MostafaMaroof/Naqta} |
| } |
| ``` |
|
|