MostafaMaroof
/

Naqta

@@ -21,96 +21,162 @@ model-index:
       type: custom
     metrics:
     - type: f1
-      value: 0.8176
       name: Validation Macro F1
     - type: accuracy
-      value: 0.9589
       name: Validation Accuracy
 ---
-# Naqta
-**Naqta** is an Arabic punctuation restoration model. It predicts missing punctuation marks in unpunctuated Arabic text using token-level sequence classification.
-The model is designed to restore the following punctuation marks:
-| Label | Meaning |
-|---|---|
-| `O` | No punctuation |
-| `.` | Period |
-| `،` | Arabic comma |
-| `؟` | Arabic question mark |
-| `!` | Exclamation mark |
-| `:` | Colon |
-| `؛` | Arabic semicolon |
-| `-` | Dash |
-## Model Details
-- **Model name:** Naqta
-- **Task:** Arabic punctuation restoration
-- **Architecture:** XLM-RoBERTa Large for token classification
-- **Base model:** `xlm-roberta-large`
-- **Maximum sequence length:** 384 tokens
-- **Training objective:** token-level punctuation classification
-- **Loss:** weighted focal loss during fine-tuning
-- **Focal gamma:** 2.0
-## Training Summary
-Naqta was trained on a mixed Arabic corpus built from multiple sources, including books, Arabic corpora, Wikipedia-style text, and question-answering data. The training pipeline used sliding-window context, class balancing, rare punctuation oversampling, and a two-phase training strategy.
-### Training Strategy
-| Phase | Description |
-|---|---|
-| Phase 1 | General token-classification training for 2 epochs |
-| Phase 2 | Focal-loss fine-tuning for 2 epochs with lower encoder layers frozen |
-### Data Balancing
-The final training setup used stronger sampling for rare punctuation marks:
-- Strong rare marks: `؟`, `!`
-- Light rare marks: `؛`, `-`
-- Sliding-window context was applied to training data only
-- Validation and test data remained unwindowed to avoid leakage
-## Validation Results
-Final best validation result:
 | Metric | Score |
 |---|---:|
-| **Macro F1** | **0.8176** |
-| Accuracy | 0.9589 |
-### Per-Class Validation F1
-| Class | F1 |
-|---|---:|
-| `!` | 0.6512 |
-| `؛` | 0.7180 |
-| `؟` | 0.9066 |
-| `-` | 0.8562 |
-| `،` | 0.7422 |
-| `.` | 0.8030 |
-## Example
-Input:
-```text
-اذا اردت ان تنجح في حياتك فعليك ان تفتح اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
-```
-Possible output:
-```text
-اذا اردت ان تنجح في حياتك، فعليك ان تفتح اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
-```
-## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
@@ -146,7 +212,6 @@ previous_word_id = None
 for token_id, word_id in zip(pred_ids, word_ids):
     if word_id is None or word_id == previous_word_id:
         continue
     word = words[word_id]
     label = id2label[token_id]
     if label != "O":
@@ -156,37 +221,70 @@ for token_id, word_id in zip(pred_ids, word_ids):
 restored_text = " ".join(restored_words)
 print(restored_text)
 ```
-## Intended Use
-Naqta can be used for:
-- Restoring punctuation in Arabic ASR transcripts
-- Improving readability of unpunctuated Arabic text
-- Preprocessing Arabic text for downstream NLP tasks
-- Educational or research applications involving Arabic punctuation
-## Limitations
-- Punctuation restoration is partly stylistic, so multiple outputs may be valid.
-- The model may over-insert commas in long literary or formal sentences.
-- Very short or fragmented text may produce less reliable punctuation.
-- Domain-specific text, such as legal, medical, or highly dialectal content, may require additional fine-tuning.
-- The model predicts punctuation after words and does not perform full grammar correction.
-## Training Notes
-The model was optimized to improve rare punctuation classes, especially `!`, `؟`, `؛`, and `-`. The final configuration achieved a validation Macro F1 above 0.81, with especially strong performance on question marks and dashes.
-## License
-This model is released under the MIT License.
-## Citation
-If you use this model, please cite or reference the Hugging Face repository:
-```text
-MostafaMaroof/Naqta
 ```

       type: custom
     metrics:
     - type: f1
+      value: 0.8960
       name: Validation Macro F1
     - type: accuracy
+      value: 0.9714
       name: Validation Accuracy
 ---
+<div align="center">
+# 🔤 Naqta — نقطة
+### Arabic Punctuation Restoration
+[![Model](https://img.shields.io/badge/🤗%20Model-MostafaMaroof%2FNaqta-blue)](https://huggingface.co/MostafaMaroof/Naqta)
+[![Language](https://img.shields.io/badge/Language-Arabic-green)](https://huggingface.co/MostafaMaroof/Naqta)
+[![Task](https://img.shields.io/badge/Task-Token%20Classification-orange)](https://huggingface.co/MostafaMaroof/Naqta)
+[![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT)
+[![Macro F1](https://img.shields.io/badge/Macro%20F1-89.6%25-brightgreen)](https://huggingface.co/MostafaMaroof/Naqta)
+</div>
+---
+**Naqta** (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of **XLM-RoBERTa Large**.
+> 💡 **Try it live** on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta)
+---
+## ✨ What Does It Restore?
+| Symbol | Name | Example |
+|:---:|---|---|
+| `.` | Period | نهاية الجملة |
+| `،` | Arabic comma | فاصلة عربية |
+| `؟` | Arabic question mark | علامة استفهام |
+| `!` | Exclamation mark | علامة تعجب |
+| `:` | Colon | نقطتان |
+| `؛` | Arabic semicolon | فاصلة منقوطة |
+| `-` | Dash | شرطة |
+---
+## 🏆 Results
+### Validation Metrics (v11d — Final)
 | Metric | Score |
 |---|---:|
+| 🎯 **Macro F1** | **0.8960** |
+| ✅ Accuracy | 0.9714 |
+### Per-Class F1 Score
+| Class | Symbol | F1 | Performance |
+|---|:---:|---:|---|
+| Exclamation | `!` | 0.8897 | 🟢 Excellent |
+| Arabic semicolon | `؛` | 0.8042 | 🟢 Excellent |
+| Question mark | `؟` | 0.9665 | 🟢 Excellent |
+| Dash | `-` | 0.9007 | 🟢 Excellent |
+| Arabic comma | `،` | 0.8100 | 🟢 Excellent |
+| Period | `.` | 0.8968 | 🟢 Excellent |
+---
+## 🗂️ Training Data
+The model was trained on a large multi-source Arabic corpus totaling over **1.4 million paragraphs** from six diverse sources, covering a broad range of Arabic writing styles and domains.
+### Corpus Sources
+| Source | Rows | Domain |
+|---|---:|---|
+| **ABC / UNPC** | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) |
+| **HF Tashkeel** | ~151,000 | Vocalized Arabic text (diacritized corpus) |
+| **Hindawi E-Books** | ~100,000 | Literary Arabic prose (novels & non-fiction) |
+| **Wikipedia (AR)** | ~98,500 | Encyclopedia articles |
+| **CBT** | ~69,000 | Classical Arabic books & religious texts |
+| **ARCD + XQuAD** | ~2,050 | Arabic QA pairs (rich in question marks `؟`) |
+| **Total (raw)** | **~1,441,000** | — |
+> All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training.
+### Punctuation Coverage (raw corpus)
+| Mark | Name | Paragraphs | Coverage |
+|:---:|---|---:|---:|
+| `،` | Arabic comma | 922,721 | 64.0% |
+| `:` | Colon | 230,150 | 16.0% |
+| `؛` | Arabic semicolon | 128,744 | 8.9% |
+| `؟` | Question mark | 50,282 | 3.5% |
+| `!` | Exclamation | 15,976 | 1.1% |
+| `-` | Dash | ~1 | <0.1% |
+### Data Balance Strategy
+To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied:
+| Strategy | Marks | Multiplier | Cap |
+|---|:---:|:---:|---:|
+| Strong oversampling | `؟` `!` | ×8 | 80,000 rows |
+| Light oversampling | `؛` `-` | ×6 | 80,000 rows |
+After oversampling, the combined training pool grew to **~2.4 million paragraphs**.
+### Dataset Splits
+| Split | Sequences | Share |
+|---|---:|---:|
+| Train (capped) | 1,000,000 | 85% |
+| Validation | 40,000 | 10% |
+| Test | — | 5% |
+- **Sliding-window context** (window=3 sentences, stride=2) was applied to training data only
+- Validation and test sets remain un-windowed for clean, unbiased evaluation
+- Splits were stratified by the rarest punctuation mark in each sequence
+### Preprocessing
+- Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped
+- Label assigned per word = punctuation mark **following** that word
+- Multi-subword words: only the first subword receives the label; others are masked (`-100`)
+---
+## ⚙️ Model Architecture & Training
+| Setting | Value |
+|---|---|
+| Base model | `xlm-roberta-large` (~560M params) |
+| Task | Token classification (8 labels) |
+| Max sequence length | 384 tokens |
+| Training examples | 1,000,000 (capped) |
+| Validation examples | 40,000 |
+### Two-Phase Training
+| Phase | Epochs | LR | Loss | Notes |
+|---|:---:|---|---|---|
+| Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning |
+| Phase 2 | 1 | 6e-6 | Focal loss (γ=2.0) + class weights | Bottom 12 layers frozen |
+### Class Weights
+Rare class weights were additionally boosted:
+| Class | Boost |
+|:---:|---|
+| `؟` | ×1.2 |
+| `!` | ×3.0 |
+| `؛` | ×2.0 |
+| `-` | ×1.3 |
+---
+## 🚀 Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 for token_id, word_id in zip(pred_ids, word_ids):
     if word_id is None or word_id == previous_word_id:
         continue
     word = words[word_id]
     label = id2label[token_id]
     if label != "O":
 restored_text = " ".join(restored_words)
 print(restored_text)
+# → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات.
 ```
+---
+## 📖 Example
+**Input** (unpunctuated):
+```
+اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها
+```
+**Output** (restored):
+```
+اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها.
+```
+**Question example:**
+```
+من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع
+```
+```
+من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟
+```
+---
+## 🎯 Intended Use
+Naqta is well-suited for:
+- 🎙️ **ASR post-processing** — restoring punctuation in Arabic speech transcripts
+- 📄 **Readability enhancement** — making raw Arabic text easier to read
+- 🔧 **NLP preprocessing** — improving text quality for downstream Arabic NLP tasks
+- 🔬 **Research** — Arabic punctuation restoration benchmark evaluation
+---
+## ⚠️ Limitations
+- Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input.
+- Performance may degrade on highly dialectal, technical, or domain-specific text.
+- The model does not predict quotation marks or dialogue markers (`«»`).
+- Very short or fragmented text (< 5 words) may produce less reliable results.
+- The model predicts punctuation position only and does not perform grammar correction.
+---
+## 📜 License
+This model is released under the **MIT License**.
+---
+## 🔗 Citation
+If you use Naqta in your work, please reference:
+```bibtex
+@misc{naqta2025,
+  title     = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa},
+  author    = {MostafaMaroof},
+  year      = {2025},
+  publisher = {Hugging Face},
+  url       = {https://huggingface.co/MostafaMaroof/Naqta}
+}
 ```