Rawi Ensemble — flagship Arabic diacritizer (single ONNX)

The headline diacritization (tashkeel) model of the rawi line: a gated ensemble of rawi-v2 and rawi-v3 stitched into a single ONNX graph — one input, one session.run.

Collaboration

TigreGotico — collected the corpus and authored the training notebooks.
Mike Hansen — trained the rawi-v2 and rawi-v3 models.

Results

Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only DER*, and WER):

variant	DER	DER*	WER	size
`rawi_ensemble.onnx` (fp32)	2.03%	2.93%	7.47%	19.5 MB
`rawi_ensemble.int8.onnx`	2.04%	2.94%	7.51%	4.9 MB

The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → V2+V3 2.04%. INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats in the text2tashkeel docs before comparing across datasets.

How it works

Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id sequence feeds both sub-graphs, and the gating is done inside the graph:

input → [ rawi-v2 → argmax ]  ──(≠0 ? mark : no-mark)──┐
                                                        ├→ gated_cls
input → [ rawi-v3 → value head → argmax ] ─────────────┘

gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value))

rawi-v2 decides where a mark goes (its presence is the best-calibrated gate of the family);
rawi-v3's value head decides which mark (the best per-marked-position accuracy of any rawi model, DER* 2.92%).

The factored where / which design — developed across rawi V1 → V2 → V3 — combined in one pass. Diacritization works on the NFD form and restores hamza and the superscript (dagger) alef in addition to the standard harakāt.

Files

file	size	what
`rawi_ensemble.onnx`	19.5 MB	fp32 stitched graph
`rawi_ensemble.int8.onnx`	4.9 MB	INT8 (both sub-models quantized)
`vocab.json`	—	`char_to_idx` / `diac_to_idx` (shared by both sub-models)

I/O. Input input (int64 [B, N]) = NFD-bare char ids from char_to_idx. Output gated_cls (int64 [B, N]) = one diacritic-class id per position; map non-zero classes back through diac_to_idx and attach to the letter. Tokenization and detokenization stay in host code (ONNX has no Unicode logic).

Usage

from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

Full benchmarks, the gating analysis, and the stitch tool (tools/build_ensemble_v2v3_onnx.py) are in the text2tashkeel repo.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

TigreGotico
/

rawi-ensemble