Rawi V3 — Arabic Diacritizer (two-head gated)

A lightweight Arabic diacritizer (tashkeel) that internalizes the gate into a single network. Successor to TigreGotico/rawi-v2.

Collaboration

TigreGotico — collected and aggregated the training corpus and authored the training notebook.
Mike Hansen — trained the model.

Architecture

A shared bidirectional LSTM encoder feeds two heads:

presence head (Linear(512 → 1), sigmoid) — decides where a mark goes (the gate);
value head (Linear(512 → 75), argmax) — decides which mark, given one.

Embedding(236, 128) → 2-layer BiLSTM(256) → {presence, value}. At inference a mark is applied to a letter only where sigmoid(presence) > 0.5, then the value head's class is attached. This is the single-network form of the two-model "gate + value" ensemble (rawi-v2 gating rawi-v1) — one pass instead of two.

Normalization is NFD with symbols (So) dropped; the value head's 75 NFD-based classes restore hamzas and the superscript alef. Vocabulary is identical to V2.

Files

file	what
`rawi_v3.onnx`	fp32 ONNX — two outputs `presence` (B,T) and `value` (B,T,75)
`rawi_v3.int8.onnx`	INT8 dynamic quantization (~2.5 MB)
`rawi_v3.vocab.json`	`char_to_idx` / `diac_to_idx` (shared with V2)
`diacritization_model_lstm_3.pth`	original PyTorch checkpoint
`rawi_v3_train.ipynb`	training notebook (two-stage model)

Usage

from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

text2tashkeel bundles this model and runs it in pure onnxruntime (no PyTorch at inference); full benchmarks and the where/which gating analysis are in that repo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

TigreGotico
/

rawi-v3

Rawi V3 — Arabic Diacritizer (two-head gated)

Collaboration

Architecture

Files

Usage

Dataset used to train TigreGotico/rawi-v3