Arabic Punctuation Restoration Model (BiLSTM)
This is a Bidirectional LSTM (BiLSTM) model designed to restore punctuation marks in raw Arabic text. It takes unpunctuated Arabic text as input and inserts the appropriate punctuation marks.
Model Details
- Architecture: BiLSTM (2 Layers, Hidden Dim 256)
- Embeddings: AraVec (Twitter-CBOW 300d)
- Vocabulary Size: ~50k words
- Input: Raw Arabic text (with or without diacritics)
- Output: Text with restored punctuation marks
Supported Punctuation Marks
The model predicts the following punctuation marks:
| ID | Mark | Name |
|---|---|---|
| 0 | (None) | No Punctuation |
| 1 | ? | Question Mark (؟) |
| 2 | ، | Arabic Comma |
| 3 | : | Colon |
| 4 | ؛ | Arabic Semicolon |
| 5 | ! | Exclamation Mark |
| 6 | . | Period / Full Stop |
How to Use
Since this is a custom PyTorch model, you need to load the model structure and vocabulary.
Method 1: Using the Inference Script (Recommended)
Download the inference.py file from this repository to use the model easily.
from huggingface_hub import hf_hub_download
import importlib.util
# 1. Download the script
script_path = hf_hub_download(repo_id="malkhuzanie/arabic-punctuation-checkpoints", filename="inference.py")
# 2. Load the script
spec = importlib.util.spec_from_file_location("inference", script_path)
inference = importlib.util.module_from_spec(spec)
spec.loader.exec_module(inference)
# 3. Initialize and Predict
model = inference.PunctuationRestorer()
text = "هل تساءلت يوما عن معنى الحياة ما هي الأسئلة التي تشغل بالك"
print(model.predict(text))
# Output: هل تساءلت يوماً عن معنى الحياة؟ ما هي الأسئلة التي تشغل بالك؟