| --- |
| license: apache-2.0 |
| datasets: |
| - TigreGotico/arabic_diacritized_text |
| language: |
| - ar |
| tags: |
| - arabic |
| - nlp |
| - diacritization |
| - tashkeel |
| - onnx |
| --- |
| |
| # Rawi Ensemble — flagship Arabic diacritizer (single ONNX) |
|
|
| The headline diacritization (tashkeel) model of the |
| [rawi](https://huggingface.co/TigreGotico/rawi-v2) line: a **gated ensemble of |
| [rawi-v2](https://huggingface.co/TigreGotico/rawi-v2) and |
| [rawi-v3](https://huggingface.co/TigreGotico/rawi-v3) stitched into a single ONNX |
| graph** — one input, one `session.run`. |
|
|
| ## Credits |
|
|
| - **TigreGotico** — author: the corpus, the model, and the training notebook. |
| - **Mike Hansen** — ran the training on his GPUs. |
|
|
| ## Results |
|
|
| Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only |
| DER\*, and WER): |
| |
| | variant | DER | DER\* | WER | size | |
| |---|---:|---:|---:|---:| |
| | `rawi_ensemble.onnx` (fp32) | **2.03%** | 2.93% | **7.47%** | 19.5 MB | |
| | `rawi_ensemble.int8.onnx` | **2.04%** | 2.94% | 7.51% | 4.9 MB | |
|
|
| The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → **V2+V3 2.04%**. |
| INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats |
| in the text2tashkeel docs before comparing across datasets. |
|
|
| ## How it works |
|
|
| Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id |
| sequence feeds both sub-graphs, and the gating is done **inside the graph**: |
|
|
| ``` |
| input → [ rawi-v2 → argmax ] ──(≠0 ? mark : no-mark)──┐ |
| ├→ gated_cls |
| input → [ rawi-v3 → value head → argmax ] ─────────────┘ |
| |
| gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value)) |
| ``` |
|
|
| - **rawi-v2** decides **where** a mark goes (its presence is the best-calibrated |
| gate of the family); |
| - **rawi-v3**'s value head decides **which** mark (the best per-marked-position |
| accuracy of any rawi model, DER\* 2.92%). |
|
|
| The factored *where / which* design — developed across rawi V1 → V2 → V3 — combined |
| in one pass. Diacritization works on the NFD form and restores hamza and the |
| superscript (dagger) alef in addition to the standard harakāt. |
|
|
| ## Files |
|
|
| | file | size | what | |
| |---|---|---| |
| | `rawi_ensemble.onnx` | 19.5 MB | fp32 stitched graph | |
| | `rawi_ensemble.int8.onnx` | 4.9 MB | INT8 (both sub-models quantized) | |
| | `vocab.json` | — | `char_to_idx` / `diac_to_idx` (shared by both sub-models) | |
|
|
| **I/O.** Input `input` (int64 `[B, N]`) = NFD-bare char ids from `char_to_idx`. |
| Output `gated_cls` (int64 `[B, N]`) = one diacritic-class id per position; map |
| non-zero classes back through `diac_to_idx` and attach to the letter. Tokenization |
| and detokenization stay in host code (ONNX has no Unicode logic). |
|
|
| ## Usage |
|
|
| ```python |
| from text2tashkeel import Diacritizer # pip install text2tashkeel |
| Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم") |
| # بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ |
| ``` |
|
|
| Full benchmarks, the gating analysis, and the stitch tool |
| (`tools/build_ensemble_v2v3_onnx.py`) are in the `text2tashkeel` repo. |
|
|