--- language: - ar license: mit tags: - arabic - punctuation-restoration - token-classification - xlm-roberta - natural-language-processing pipeline_tag: token-classification base_model: xlm-roberta-large model-index: - name: Naqta results: - task: type: token-classification name: Arabic Punctuation Restoration dataset: name: Mixed Arabic punctuation restoration corpus type: custom metrics: - type: f1 value: 0.8960 name: Validation Macro F1 - type: accuracy value: 0.9714 name: Validation Accuracy ---
# 🔤 Naqta — نقطة ### Arabic Punctuation Restoration [![Model](https://img.shields.io/badge/🤗%20Model-MostafaMaroof%2FNaqta-blue)](https://huggingface.co/MostafaMaroof/Naqta) [![Language](https://img.shields.io/badge/Language-Arabic-green)](https://huggingface.co/MostafaMaroof/Naqta) [![Task](https://img.shields.io/badge/Task-Token%20Classification-orange)](https://huggingface.co/MostafaMaroof/Naqta) [![License](https://img.shields.io/badge/License-MIT-yellow)](https://opensource.org/licenses/MIT) [![Macro F1](https://img.shields.io/badge/Macro%20F1-89.6%25-brightgreen)](https://huggingface.co/MostafaMaroof/Naqta)
--- **Naqta** (Arabic: نقطة, "dot/period") is a high-performance Arabic punctuation restoration model. Given plain unpunctuated Arabic text, it predicts the correct punctuation mark after each word using token-level sequence classification on top of **XLM-RoBERTa Large**. > 💡 **Try it live** on the [Hugging Face Space](https://huggingface.co/spaces/MostafaMaroof/Naqta) --- ## ✨ What Does It Restore? | Symbol | Name | Example | |:---:|---|---| | `.` | Period | نهاية الجملة | | `،` | Arabic comma | فاصلة عربية | | `؟` | Arabic question mark | علامة استفهام | | `!` | Exclamation mark | علامة تعجب | | `:` | Colon | نقطتان | | `؛` | Arabic semicolon | فاصلة منقوطة | | `-` | Dash | شرطة | --- ## 🏆 Results ### Validation Metrics (v11d — Final) | Metric | Score | |---|---:| | 🎯 **Macro F1** | **0.8960** | | ✅ Accuracy | 0.9714 | ### Per-Class F1 Score | Class | Symbol | F1 | Performance | |---|:---:|---:|---| | Exclamation | `!` | 0.8897 | 🟢 Excellent | | Arabic semicolon | `؛` | 0.8042 | 🟢 Excellent | | Question mark | `؟` | 0.9665 | 🟢 Excellent | | Dash | `-` | 0.9007 | 🟢 Excellent | | Arabic comma | `،` | 0.8100 | 🟢 Excellent | | Period | `.` | 0.8968 | 🟢 Excellent | --- ## 🗂️ Training Data The model was trained on a large multi-source Arabic corpus totaling over **1.4 million paragraphs** from six diverse sources, covering a broad range of Arabic writing styles and domains. ### Corpus Sources | Source | Rows | Domain | |---|---:|---| | **ABC / UNPC** | ~1,020,000 | News & formal Arabic (United Nations Parallel Corpus) | | **HF Tashkeel** | ~151,000 | Vocalized Arabic text (diacritized corpus) | | **Hindawi E-Books** | ~100,000 | Literary Arabic prose (novels & non-fiction) | | **Wikipedia (AR)** | ~98,500 | Encyclopedia articles | | **CBT** | ~69,000 | Classical Arabic books & religious texts | | **ARCD + XQuAD** | ~2,050 | Arabic QA pairs (rich in question marks `؟`) | | **Total (raw)** | **~1,441,000** | — | > All paragraphs were filtered to contain at least one Arabic letter and one target punctuation mark. Non-target punctuation (e.g., `«»`, `…`, parentheses) was removed before training. ### Punctuation Coverage (raw corpus) | Mark | Name | Paragraphs | Coverage | |:---:|---|---:|---:| | `،` | Arabic comma | 922,721 | 64.0% | | `:` | Colon | 230,150 | 16.0% | | `؛` | Arabic semicolon | 128,744 | 8.9% | | `؟` | Question mark | 50,282 | 3.5% | | `!` | Exclamation | 15,976 | 1.1% | | `-` | Dash | ~1 | <0.1% | ### Data Balance Strategy To prevent the model from ignoring rare punctuation marks, a targeted oversampling strategy was applied: | Strategy | Marks | Multiplier | Cap | |---|:---:|:---:|---:| | Strong oversampling | `؟` `!` | ×8 | 80,000 rows | | Light oversampling | `؛` `-` | ×6 | 80,000 rows | After oversampling, the combined training pool grew to **~2.4 million paragraphs**. ### Dataset Splits | Split | Sequences | Share | |---|---:|---:| | Train (capped) | 1,000,000 | 85% | | Validation | 40,000 | 10% | | Test | — | 5% | - **Sliding-window context** (window=3 sentences, stride=2) was applied to training data only - Validation and test sets remain un-windowed for clean, unbiased evaluation - Splits were stratified by the rarest punctuation mark in each sequence ### Preprocessing - Arabic normalization: alef variants → `ا`, ya variants → `ي`, diacritics stripped - Label assigned per word = punctuation mark **following** that word - Multi-subword words: only the first subword receives the label; others are masked (`-100`) --- ## ⚙️ Model Architecture & Training | Setting | Value | |---|---| | Base model | `xlm-roberta-large` (~560M params) | | Task | Token classification (8 labels) | | Max sequence length | 384 tokens | | Training examples | 1,000,000 (capped) | | Validation examples | 40,000 | ### Two-Phase Training | Phase | Epochs | LR | Loss | Notes | |---|:---:|---|---|---| | Phase 1 | 2 | 2e-5 | Cross-entropy + label smoothing | Full model fine-tuning | | Phase 2 | 1 | 6e-6 | Focal loss (γ=2.0) + class weights | Bottom 12 layers frozen | ### Class Weights Rare class weights were additionally boosted: | Class | Boost | |:---:|---| | `؟` | ×1.2 | | `!` | ×3.0 | | `؛` | ×2.0 | | `-` | ×1.3 | --- ## 🚀 Quick Start ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch repo_id = "MostafaMaroof/Naqta" tokenizer = AutoTokenizer.from_pretrained(repo_id) model = AutoModelForTokenClassification.from_pretrained(repo_id) model.eval() id2label = model.config.id2label text = "بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024 وهو اعلى مستوى منذ خمس سنوات" words = text.split() inputs = tokenizer( words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=384, ) with torch.no_grad(): logits = model(**inputs).logits pred_ids = logits.argmax(dim=-1)[0].tolist() word_ids = inputs.word_ids(batch_index=0) restored_words = [] previous_word_id = None for token_id, word_id in zip(pred_ids, word_ids): if word_id is None or word_id == previous_word_id: continue word = words[word_id] label = id2label[token_id] if label != "O": word = word + label restored_words.append(word) previous_word_id = word_id restored_text = " ".join(restored_words) print(restored_text) # → بلغت نسبة النمو الاقتصادي 4.7 بالمئة خلال الربع الثالث من عام 2024، وهو اعلى مستوى منذ خمس سنوات. ``` --- ## 📖 Example **Input** (unpunctuated): ``` اذا اردت ان تنجح في حياتك فعليك ان تحدد اهدافك واضحة وان تعمل بجد واستمرارية ولا تيأس عند اول عقبة تواجهها ``` **Output** (restored): ``` اذا اردت ان تنجح في حياتك، فعليك ان تحدد اهدافك واضحة، وان تعمل بجد واستمرارية، ولا تيأس عند اول عقبة تواجهها. ``` **Question example:** ``` من اخترع الهاتف وفي اي سنة تم ذلك وما هي اهمية هذا الاختراع ``` ``` من اخترع الهاتف، وفي اي سنة تم ذلك، وما هي اهمية هذا الاختراع؟ ``` --- ## 🎯 Intended Use Naqta is well-suited for: - 🎙️ **ASR post-processing** — restoring punctuation in Arabic speech transcripts - 📄 **Readability enhancement** — making raw Arabic text easier to read - 🔧 **NLP preprocessing** — improving text quality for downstream Arabic NLP tasks - 🔬 **Research** — Arabic punctuation restoration benchmark evaluation --- ## ⚠️ Limitations - Punctuation restoration is partly stylistic — multiple valid outputs may exist for a single input. - Performance may degrade on highly dialectal, technical, or domain-specific text. - The model does not predict quotation marks or dialogue markers (`«»`). - Very short or fragmented text (< 5 words) may produce less reliable results. - The model predicts punctuation position only and does not perform grammar correction. --- ## 📜 License This model is released under the **MIT License**. --- ## 🔗 Citation If you use Naqta in your work, please reference: ```bibtex @misc{naqta2025, title = {Naqta: Arabic Punctuation Restoration with XLM-RoBERTa}, author = {MostafaMaroof}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/MostafaMaroof/Naqta} } ```