--- license: apache-2.0 datasets: - TigreGotico/arabic_diacritized_text language: - ar tags: - arabic - nlp - diacritization - tashkeel - onnx --- # Rawi Ensemble — flagship Arabic diacritizer (single ONNX) The headline diacritization (tashkeel) model of the [rawi](https://huggingface.co/TigreGotico/rawi-v2) line: a **gated ensemble of [rawi-v2](https://huggingface.co/TigreGotico/rawi-v2) and [rawi-v3](https://huggingface.co/TigreGotico/rawi-v3) stitched into a single ONNX graph** — one input, one `session.run`. ## Collaboration - **TigreGotico** — collected the corpus and authored the training notebooks. - **Mike Hansen** — trained the rawi-v2 and rawi-v3 models. ## Results Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only DER\*, and WER): | variant | DER | DER\* | WER | size | |---|---:|---:|---:|---:| | `rawi_ensemble.onnx` (fp32) | **2.03%** | 2.93% | **7.47%** | 19.5 MB | | `rawi_ensemble.int8.onnx` | **2.04%** | 2.94% | 7.51% | 4.9 MB | The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → **V2+V3 2.04%**. INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats in the text2tashkeel docs before comparing across datasets. ## How it works Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id sequence feeds both sub-graphs, and the gating is done **inside the graph**: ``` input → [ rawi-v2 → argmax ] ──(≠0 ? mark : no-mark)──┐ ├→ gated_cls input → [ rawi-v3 → value head → argmax ] ─────────────┘ gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value)) ``` - **rawi-v2** decides **where** a mark goes (its presence is the best-calibrated gate of the family); - **rawi-v3**'s value head decides **which** mark (the best per-marked-position accuracy of any rawi model, DER\* 2.92%). The factored *where / which* design — developed across rawi V1 → V2 → V3 — combined in one pass. Diacritization works on the NFD form and restores hamza and the superscript (dagger) alef in addition to the standard harakāt. ## Files | file | size | what | |---|---|---| | `rawi_ensemble.onnx` | 19.5 MB | fp32 stitched graph | | `rawi_ensemble.int8.onnx` | 4.9 MB | INT8 (both sub-models quantized) | | `vocab.json` | — | `char_to_idx` / `diac_to_idx` (shared by both sub-models) | **I/O.** Input `input` (int64 `[B, N]`) = NFD-bare char ids from `char_to_idx`. Output `gated_cls` (int64 `[B, N]`) = one diacritic-class id per position; map non-zero classes back through `diac_to_idx` and attach to the letter. Tokenization and detokenization stay in host code (ONNX has no Unicode logic). ## Usage ```python from text2tashkeel import Diacritizer # pip install text2tashkeel Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم") # بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ ``` Full benchmarks, the gating analysis, and the stitch tool (`tools/build_ensemble_v2v3_onnx.py`) are in the `text2tashkeel` repo.