ONNX
Arabic
arabic
nlp
diacritization
tashkeel
rawi-ensemble / README.md
Jarbas's picture
Upload README.md with huggingface_hub
de3fca3 verified
---
license: apache-2.0
datasets:
- TigreGotico/arabic_diacritized_text
language:
- ar
tags:
- arabic
- nlp
- diacritization
- tashkeel
- onnx
---
# Rawi Ensemble — flagship Arabic diacritizer (single ONNX)
The headline diacritization (tashkeel) model of the
[rawi](https://huggingface.co/TigreGotico/rawi-v2) line: a **gated ensemble of
[rawi-v2](https://huggingface.co/TigreGotico/rawi-v2) and
[rawi-v3](https://huggingface.co/TigreGotico/rawi-v3) stitched into a single ONNX
graph** — one input, one `session.run`.
## Credits
- **TigreGotico** — author: the corpus, the model, and the training notebook.
- **Mike Hansen** — ran the training on his GPUs.
## Results
Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only
DER\*, and WER):
| variant | DER | DER\* | WER | size |
|---|---:|---:|---:|---:|
| `rawi_ensemble.onnx` (fp32) | **2.03%** | 2.93% | **7.47%** | 19.5 MB |
| `rawi_ensemble.int8.onnx` | **2.04%** | 2.94% | 7.51% | 4.9 MB |
The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → **V2+V3 2.04%**.
INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats
in the text2tashkeel docs before comparing across datasets.
## How it works
Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id
sequence feeds both sub-graphs, and the gating is done **inside the graph**:
```
input → [ rawi-v2 → argmax ] ──(≠0 ? mark : no-mark)──┐
├→ gated_cls
input → [ rawi-v3 → value head → argmax ] ─────────────┘
gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value))
```
- **rawi-v2** decides **where** a mark goes (its presence is the best-calibrated
gate of the family);
- **rawi-v3**'s value head decides **which** mark (the best per-marked-position
accuracy of any rawi model, DER\* 2.92%).
The factored *where / which* design — developed across rawi V1 → V2 → V3 — combined
in one pass. Diacritization works on the NFD form and restores hamza and the
superscript (dagger) alef in addition to the standard harakāt.
## Files
| file | size | what |
|---|---|---|
| `rawi_ensemble.onnx` | 19.5 MB | fp32 stitched graph |
| `rawi_ensemble.int8.onnx` | 4.9 MB | INT8 (both sub-models quantized) |
| `vocab.json` | — | `char_to_idx` / `diac_to_idx` (shared by both sub-models) |
**I/O.** Input `input` (int64 `[B, N]`) = NFD-bare char ids from `char_to_idx`.
Output `gated_cls` (int64 `[B, N]`) = one diacritic-class id per position; map
non-zero classes back through `diac_to_idx` and attach to the letter. Tokenization
and detokenization stay in host code (ONNX has no Unicode logic).
## Usage
```python
from text2tashkeel import Diacritizer # pip install text2tashkeel
Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
```
Full benchmarks, the gating analysis, and the stitch tool
(`tools/build_ensemble_v2v3_onnx.py`) are in the `text2tashkeel` repo.