ONNX
Arabic
arabic
nlp
diacritization
tashkeel

Rawi Ensemble — flagship Arabic diacritizer (single ONNX)

The headline diacritization (tashkeel) model of the rawi line: a gated ensemble of rawi-v2 and rawi-v3 stitched into a single ONNX graph — one input, one session.run.

Collaboration

  • TigreGotico — collected the corpus and authored the training notebooks.
  • Mike Hansen — trained the rawi-v2 and rawi-v3 models.

Results

Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only DER*, and WER):

variant DER DER* WER size
rawi_ensemble.onnx (fp32) 2.03% 2.93% 7.47% 19.5 MB
rawi_ensemble.int8.onnx 2.04% 2.94% 7.51% 4.9 MB

The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → V2+V3 2.04%. INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats in the text2tashkeel docs before comparing across datasets.

How it works

Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id sequence feeds both sub-graphs, and the gating is done inside the graph:

input → [ rawi-v2 → argmax ]  ──(≠0 ? mark : no-mark)──┐
                                                        ├→ gated_cls
input → [ rawi-v3 → value head → argmax ] ─────────────┘

gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value))
  • rawi-v2 decides where a mark goes (its presence is the best-calibrated gate of the family);
  • rawi-v3's value head decides which mark (the best per-marked-position accuracy of any rawi model, DER* 2.92%).

The factored where / which design — developed across rawi V1 → V2 → V3 — combined in one pass. Diacritization works on the NFD form and restores hamza and the superscript (dagger) alef in addition to the standard harakāt.

Files

file size what
rawi_ensemble.onnx 19.5 MB fp32 stitched graph
rawi_ensemble.int8.onnx 4.9 MB INT8 (both sub-models quantized)
vocab.json char_to_idx / diac_to_idx (shared by both sub-models)

I/O. Input input (int64 [B, N]) = NFD-bare char ids from char_to_idx. Output gated_cls (int64 [B, N]) = one diacritic-class id per position; map non-zero classes back through diac_to_idx and attach to the letter. Tokenization and detokenization stay in host code (ONNX has no Unicode logic).

Usage

from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ

Full benchmarks, the gating analysis, and the stitch tool (tools/build_ensemble_v2v3_onnx.py) are in the text2tashkeel repo.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TigreGotico/rawi-ensemble