TigreGotico
/

rawi-ensemble

Model card Files Files and versions

rawi-ensemble / README.md

Jarbas's picture

Upload README.md with huggingface_hub

de3fca3 verified 1 day ago

|

history blame contribute delete

3.13 kB

	---
	license: apache-2.0
	datasets:
	- TigreGotico/arabic_diacritized_text
	language:
	- ar
	tags:
	- arabic
	- nlp
	- diacritization
	- tashkeel
	- onnx
	---

	# Rawi Ensemble — flagship Arabic diacritizer (single ONNX)

	The headline diacritization (tashkeel) model of the
	[rawi](https://huggingface.co/TigreGotico/rawi-v2) line: a **gated ensemble of
	[rawi-v2](https://huggingface.co/TigreGotico/rawi-v2) and
	[rawi-v3](https://huggingface.co/TigreGotico/rawi-v3) stitched into a single ONNX
	graph** — one input, one `session.run`.

	## Credits

	- TigreGotico — author: the corpus, the model, and the training notebook.
	- Mike Hansen — ran the training on his GPUs.

	## Results

	Full 817k-sentence held-out test split of the corpus (per-position DER, marked-only
	DER\*, and WER):

	\| variant \| DER \| DER\* \| WER \| size \|
	\|---\|---:\|---:\|---:\|---:\|
	\| `rawi_ensemble.onnx` (fp32) \| 2.03% \| 2.93% \| 7.47% \| 19.5 MB \|
	\| `rawi_ensemble.int8.onnx` \| 2.04% \| 2.94% \| 7.51% \| 4.9 MB \|

	The best of the rawi line: V1 18.49% → V2 2.29% → V2+V1 2.20% → V2+V3 2.04%.
	INT8 is lossless here (attention-free LSTMs). Read the corpus/contamination caveats
	in the text2tashkeel docs before comparing across datasets.

	## How it works

	Two rawi models share one BiLSTM-style NFD vocabulary, so a single token-id
	sequence feeds both sub-graphs, and the gating is done inside the graph:

	```
	input → [ rawi-v2 → argmax ] ──(≠0 ? mark : no-mark)──┐
	├→ gated_cls
	input → [ rawi-v3 → value head → argmax ] ─────────────┘

	gated_cls = where(argmax(rawi_v2) == 0, 0, argmax(rawi_v3.value))
	```

	- rawi-v2 decides where a mark goes (its presence is the best-calibrated
	gate of the family);
	- rawi-v3's value head decides which mark (the best per-marked-position
	accuracy of any rawi model, DER\* 2.92%).

	The factored where / which design — developed across rawi V1 → V2 → V3 — combined
	in one pass. Diacritization works on the NFD form and restores hamza and the
	superscript (dagger) alef in addition to the standard harakāt.

	## Files

	\| file \| size \| what \|
	\|---\|---\|---\|
	\| `rawi_ensemble.onnx` \| 19.5 MB \| fp32 stitched graph \|
	\| `rawi_ensemble.int8.onnx` \| 4.9 MB \| INT8 (both sub-models quantized) \|
	\| `vocab.json` \| — \| `char_to_idx` / `diac_to_idx` (shared by both sub-models) \|

	I/O. Input `input` (int64 `[B, N]`) = NFD-bare char ids from `char_to_idx`.
	Output `gated_cls` (int64 `[B, N]`) = one diacritic-class id per position; map
	non-zero classes back through `diac_to_idx` and attach to the letter. Tokenization
	and detokenization stay in host code (ONNX has no Unicode logic).

	## Usage

	```python
	from text2tashkeel import Diacritizer # pip install text2tashkeel
	Diacritizer("rawi-v2+rawi-v3").diacritize("بسم الله الرحمن الرحيم")
	# بِسْمِ اللَّهِ الرَّحْمَنِ الرَّحِيمِ
	```

	Full benchmarks, the gating analysis, and the stitch tool
	(`tools/build_ensemble_v2v3_onnx.py`) are in the `text2tashkeel` repo.