Rawi Ensemble — flagship Arabic diacritizer (single ONNX)
The headline diacritization (tashkeel) model of the
rawi line: a gated ensemble of
rawi-v2 and
rawi-v3 stitched into a single ONNX
graph — one input, one session.run.
Collaboration
- TigreGotico — collected the corpus and authored the training notebooks.
- Mike Hansen — trained the rawi-v2 and rawi-v3 models.
Results
Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only DER*, and WER):
| variant | DER | DER* | WER | size |
|---|---|---|---|---|
rawi_ensemble.onnx (fp32) |
2.03% | 2.93% | 7.47% | 19.5 MB |
rawi_ensemble.int8.onnx |
2.04% | 2.94% | 7.51% | 4.9 MB |
The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → V2+V3 2.04%. INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats in the text2tashkeel docs before comparing across datasets.
How it works
Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id sequence feeds both sub-graphs, and the gating is done inside the graph:
input → [ rawi-v2 → argmax ] ──(≠0 ? mark : no-mark)──┐
├→ gated_cls
input → [ rawi-v3 → value head → argmax ] ─────────────┘
gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value))
- rawi-v2 decides where a mark goes (its presence is the best-calibrated gate of the family);
- rawi-v3's value head decides which mark (the best per-marked-position accuracy of any rawi model, DER* 2.92%).
The factored where / which design — developed across rawi V1 → V2 → V3 — combined in one pass. Diacritization works on the NFD form and restores hamza and the superscript (dagger) alef in addition to the standard harakāt.
Files
| file | size | what |
|---|---|---|
rawi_ensemble.onnx |
19.5 MB | fp32 stitched graph |
rawi_ensemble.int8.onnx |
4.9 MB | INT8 (both sub-models quantized) |
vocab.json |
— | char_to_idx / diac_to_idx (shared by both sub-models) |
I/O. Input input (int64 [B, N]) = NFD-bare char ids from char_to_idx.
Output gated_cls (int64 [B, N]) = one diacritic-class id per position; map
non-zero classes back through diac_to_idx and attach to the letter. Tokenization
and detokenization stay in host code (ONNX has no Unicode logic).
Usage
from text2tashkeel import Diacritizer # pip install text2tashkeel
Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
Full benchmarks, the gating analysis, and the stitch tool
(tools/build_ensemble_v2v3_onnx.py) are in the text2tashkeel repo.