Rawi V3 — Arabic Diacritizer (two-head gated)
A lightweight Arabic diacritizer (tashkeel) that internalizes the gate into a
single network. Successor to TigreGotico/rawi-v2.
Collaboration
- TigreGotico — collected and aggregated the training corpus and authored the training notebook.
- Mike Hansen — trained the model.
Architecture
A shared bidirectional LSTM encoder feeds two heads:
- presence head (
Linear(512 → 1), sigmoid) — decides where a mark goes (the gate); - value head (
Linear(512 → 75), argmax) — decides which mark, given one.
Embedding(236, 128) → 2-layer BiLSTM(256) → {presence, value}. At inference a
mark is applied to a letter only where sigmoid(presence) > 0.5, then the value
head's class is attached. This is the single-network form of the two-model
"gate + value" ensemble (rawi-v2 gating rawi-v1) — one pass instead of two.
Normalization is NFD with symbols (So) dropped; the value head's 75 NFD-based
classes restore hamzas and the superscript alef. Vocabulary is identical to V2.
Files
| file | what |
|---|---|
rawi_v3.onnx |
fp32 ONNX — two outputs presence (B,T) and value (B,T,75) |
rawi_v3.int8.onnx |
INT8 dynamic quantization (~2.5 MB) |
rawi_v3.vocab.json |
char_to_idx / diac_to_idx (shared with V2) |
diacritization_model_lstm_3.pth |
original PyTorch checkpoint |
rawi_v3_train.ipynb |
training notebook (two-stage model) |
Usage
from text2tashkeel import Diacritizer # pip install text2tashkeel
Diacritizer("rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
text2tashkeel bundles this model and runs it in pure onnxruntime (no PyTorch at
inference); full benchmarks and the where/which gating analysis are in that repo.