ONNX
Arabic
arabic
nlp
diacritization
tashkeel

Rawi V3 — Arabic Diacritizer (two-head gated)

A lightweight Arabic diacritizer (tashkeel) that internalizes the gate into a single network. Successor to TigreGotico/rawi-v2.

Collaboration

  • TigreGotico — collected and aggregated the training corpus and authored the training notebook.
  • Mike Hansen — trained the model.

Architecture

A shared bidirectional LSTM encoder feeds two heads:

  • presence head (Linear(512 → 1), sigmoid) — decides where a mark goes (the gate);
  • value head (Linear(512 → 75), argmax) — decides which mark, given one.

Embedding(236, 128) → 2-layer BiLSTM(256) → {presence, value}. At inference a mark is applied to a letter only where sigmoid(presence) > 0.5, then the value head's class is attached. This is the single-network form of the two-model "gate + value" ensemble (rawi-v2 gating rawi-v1) — one pass instead of two.

Normalization is NFD with symbols (So) dropped; the value head's 75 NFD-based classes restore hamzas and the superscript alef. Vocabulary is identical to V2.

Files

file what
rawi_v3.onnx fp32 ONNX — two outputs presence (B,T) and value (B,T,75)
rawi_v3.int8.onnx INT8 dynamic quantization (~2.5 MB)
rawi_v3.vocab.json char_to_idx / diac_to_idx (shared with V2)
diacritization_model_lstm_3.pth original PyTorch checkpoint
rawi_v3_train.ipynb training notebook (two-stage model)

Usage

from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

text2tashkeel bundles this model and runs it in pure onnxruntime (no PyTorch at inference); full benchmarks and the where/which gating analysis are in that repo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TigreGotico/rawi-v3