byt5-darija-emphatic

ByT5-base fine-tuned to predict emphatic consonants in Moroccan Darija Arabizi-to-Arabic transliteration.

What it does

Moroccan Darija uses emphatic (pharyngealized) consonants that create minimal pairs in Arabizi Latin script:

Arabizi letter	Plain Arabic	Emphatic Arabic
`s`	س	ص
`d`	د	ض
`t`	ت	ط
`h` (word-initial)	ه	ح

This model takes an Arabizi token and outputs the Arabic script form with the correct emphatic or plain consonants — a task that rule-based systems cannot reliably solve without phonological context.

Examples:

Input (Arabizi)	Output (Arabic)	Note
`sghir`	صغير	s → ص (emphatic)
`ananas`	اناناس	s → س (plain)
`badhik`	بالضحك	d → ض (emphatic)
`diyal`	ديال	d → د (plain)
`alwatania`	الوطنية	t → ط (emphatic)
`alklimat`	الكلمات	t → ت (plain)
`alhajja`	الحاجة	h → ح (emphatic)
`huma`	هوما	h → ه (plain)

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "anasskabil/byt5-darija-emphatic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()

def predict(arabizi_token: str) -> str:
    inputs = tokenizer(
        arabizi_token.lower(),
        return_tensors="pt",
        truncation=True,
        max_length=32,
    )
    with torch.inference_mode():
        out = model.generate(**inputs, max_length=24, num_beams=1, do_sample=False)
    return tokenizer.decode(out[0], skip_special_tokens=True)

print(predict("sghir"))      # صغير     (emphatic ص)
print(predict("ananas"))     # اناناس   (plain س)
print(predict("alhajja"))    # الحاجة   (emphatic ح)
print(predict("alklimat"))   # الكلمات  (plain ت)

Input: a single lowercased Arabizi word (Latin-script Moroccan Arabic).
Output: the Arabic script transliteration with emphatic or plain consonants resolved.
The model does not add diacritics (harakat) and handles one token at a time.

Architecture

Base: google/byt5-base (582M parameters, byte-level, no tokenizer vocabulary needed)
Architecture: T5ForConditionalGeneration (encoder-decoder)
Input max length: 32 bytes
Output max length: 24 bytes

ByT5 is well-suited to this task because it operates at the byte/character level, which aligns with the character-to-character nature of emphatic prediction.

Training

This is v3, fine-tuned from the previous published checkpoint (anasskabil/byt5-darija-emphatic v1).

Hyperparameter	v1 (google/byt5-base)	v3 (this model)
Base model	google/byt5-base	anasskabil/byt5-darija-emphatic
Epochs	5	2
Learning rate	1e-4	5e-5
Batch size	32	32
Max input length	32	32
Max output length	24	24

Dataset: DODa-derived Arabizi/Arabic word pairs filtered to emphatic-ambiguous tokens, supplemented with reviewed corrections weighted by source quality.

Evaluation

Evaluated on a held-out test set of 4,727 tokens (DODa-derived, no overlap with training data).

Overall exact match (all characters must be correct):

Split	Tokens	Exact Match
Test	4,727	70.2% (3,319 / 4,727)

Per-pair Precision / Recall / F1 on emphatic consonants:

Pair	Precision	Recall	F1	TP	Total gold	Total predicted
s / ص	78.9%	83.7%	81.3%	180	215	228
d / ض	72.8%	85.1%	78.5%	126	148	173
t / ط	75.8%	86.7%	80.9%	216	249	285
h / ح	92.1%	96.9%	94.5%	539	556	585

The 70.2% token exact-match reflects that many tokens contain multiple ambiguous characters — all must be correct for the token to count. Per-character F1 on the emphatic pairs is substantially higher (78–94%).

Limitations

One token at a time: the model has no sentence-level context. Emphatic harmony across words is not captured.
Arabizi only: input must be Latin-script Moroccan Darija (Arabizi). Arabic or French input is not handled.
h-initial only: the ح/ه distinction is predicted only for word-initial h. Medial/final h in consonant clusters (e.g., kh, gh) is handled by upstream rules and not passed to this model.
Q/velar not covered: the ق/گ ambiguity is handled separately by an upstream classifier; this model covers s/d/t/h only.

License

MIT

Downloads last month: 6

Safetensors

Model size

0.6B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anasskabil/byt5-darija-emphatic

Unable to build the model tree, the base model loops to the model itself. Learn more.

Evaluation results

Token-level Exact Match (v3, test set 4727 rows)
self-reported

70.210