---
license: apache-2.0
datasets:
- TigreGotico/arabic_diacritized_text
language:
- ar
tags:
- arabic
- nlp
- diacritization
- tashkeel
- onnx
---

# Rawi Ensemble — flagship Arabic diacritizer (single ONNX)

The headline diacritization (tashkeel) model of the
[rawi](https://huggingface.co/TigreGotico/rawi-v2) line: a **gated ensemble of
[rawi-v2](https://huggingface.co/TigreGotico/rawi-v2) and
[rawi-v3](https://huggingface.co/TigreGotico/rawi-v3) stitched into a single ONNX
graph** — one input, one `session.run`.

## Collaboration

- **TigreGotico** — collected the corpus and authored the training notebooks.
- **Mike Hansen** — trained the rawi-v2 and rawi-v3 models.

## Results

Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only
DER\*, and WER):

| variant | DER | DER\* | WER | size |
|---|---:|---:|---:|---:|
| `rawi_ensemble.onnx` (fp32) | **2.03%** | 2.93% | **7.47%** | 19.5 MB |
| `rawi_ensemble.int8.onnx` | **2.04%** | 2.94% | 7.51% | 4.9 MB |

The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → **V2+V3 2.04%**.
INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats
in the text2tashkeel docs before comparing across datasets.

## How it works

Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id
sequence feeds both sub-graphs, and the gating is done **inside the graph**:

```
input → [ rawi-v2 → argmax ]  ──(≠0 ? mark : no-mark)──┐
                                                        ├→ gated_cls
input → [ rawi-v3 → value head → argmax ] ─────────────┘

gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value))
```

- **rawi-v2** decides **where** a mark goes (its presence is the best-calibrated
  gate of the family);
- **rawi-v3**'s value head decides **which** mark (the best per-marked-position
  accuracy of any rawi model, DER\* 2.92%).

The factored *where / which* design — developed across rawi V1 → V2 → V3 — combined
in one pass. Diacritization works on the NFD form and restores hamza and the
superscript (dagger) alef in addition to the standard harakāt.

## Files

| file | size | what |
|---|---|---|
| `rawi_ensemble.onnx` | 19.5 MB | fp32 stitched graph |
| `rawi_ensemble.int8.onnx` | 4.9 MB | INT8 (both sub-models quantized) |
| `vocab.json` | — | `char_to_idx` / `diac_to_idx` (shared by both sub-models) |

**I/O.** Input `input` (int64 `[B, N]`) = NFD-bare char ids from `char_to_idx`.
Output `gated_cls` (int64 `[B, N]`) = one diacritic-class id per position; map
non-zero classes back through `diac_to_idx` and attach to the letter. Tokenization
and detokenization stay in host code (ONNX has no Unicode logic).

## Usage

```python
from text2tashkeel import Diacritizer        # pip install text2tashkeel
Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم")
# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
```

Full benchmarks, the gating analysis, and the stitch tool
(`tools/build_ensemble_v2v3_onnx.py`) are in the `text2tashkeel` repo.