byt5-darija-emphatic
ByT5-base fine-tuned to predict emphatic consonants in Moroccan Darija Arabizi-to-Arabic transliteration.
What it does
Moroccan Darija uses emphatic (pharyngealized) consonants that create minimal pairs in Arabizi Latin script:
| Arabizi letter | Plain Arabic | Emphatic Arabic |
|---|---|---|
s |
س | ص |
d |
د | ض |
t |
ت | ط |
h (word-initial) |
ه | ح |
This model takes an Arabizi token and outputs the Arabic script form with the correct emphatic or plain consonants — a task that rule-based systems cannot reliably solve without phonological context.
Examples:
| Input (Arabizi) | Output (Arabic) | Note |
|---|---|---|
sghir |
صغير | s → ص (emphatic) |
ananas |
اناناس | s → س (plain) |
badhik |
بالضحك | d → ض (emphatic) |
diyal |
ديال | d → د (plain) |
alwatania |
الوطنية | t → ط (emphatic) |
alklimat |
الكلمات | t → ت (plain) |
alhajja |
الحاجة | h → ح (emphatic) |
huma |
هوما | h → ه (plain) |
Usage
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_id = "anasskabil/byt5-darija-emphatic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()
def predict(arabizi_token: str) -> str:
inputs = tokenizer(
arabizi_token.lower(),
return_tensors="pt",
truncation=True,
max_length=32,
)
with torch.inference_mode():
out = model.generate(**inputs, max_length=24, num_beams=1, do_sample=False)
return tokenizer.decode(out[0], skip_special_tokens=True)
print(predict("sghir")) # صغير (emphatic ص)
print(predict("ananas")) # اناناس (plain س)
print(predict("alhajja")) # الحاجة (emphatic ح)
print(predict("alklimat")) # الكلمات (plain ت)
Input: a single lowercased Arabizi word (Latin-script Moroccan Arabic).
Output: the Arabic script transliteration with emphatic or plain consonants resolved.
The model does not add diacritics (harakat) and handles one token at a time.
Architecture
- Base:
google/byt5-base(582M parameters, byte-level, no tokenizer vocabulary needed) - Architecture: T5ForConditionalGeneration (encoder-decoder)
- Input max length: 32 bytes
- Output max length: 24 bytes
ByT5 is well-suited to this task because it operates at the byte/character level, which aligns with the character-to-character nature of emphatic prediction.
Training
This is v3, fine-tuned from the previous published checkpoint (anasskabil/byt5-darija-emphatic v1).
| Hyperparameter | v1 (google/byt5-base) | v3 (this model) |
|---|---|---|
| Base model | google/byt5-base | anasskabil/byt5-darija-emphatic |
| Epochs | 5 | 2 |
| Learning rate | 1e-4 | 5e-5 |
| Batch size | 32 | 32 |
| Max input length | 32 | 32 |
| Max output length | 24 | 24 |
Dataset: DODa-derived Arabizi/Arabic word pairs filtered to emphatic-ambiguous tokens, supplemented with reviewed corrections weighted by source quality.
Evaluation
Evaluated on a held-out test set of 4,727 tokens (DODa-derived, no overlap with training data).
Overall exact match (all characters must be correct):
| Split | Tokens | Exact Match |
|---|---|---|
| Test | 4,727 | 70.2% (3,319 / 4,727) |
Per-pair Precision / Recall / F1 on emphatic consonants:
| Pair | Precision | Recall | F1 | TP | Total gold | Total predicted |
|---|---|---|---|---|---|---|
| s / ص | 78.9% | 83.7% | 81.3% | 180 | 215 | 228 |
| d / ض | 72.8% | 85.1% | 78.5% | 126 | 148 | 173 |
| t / ط | 75.8% | 86.7% | 80.9% | 216 | 249 | 285 |
| h / ح | 92.1% | 96.9% | 94.5% | 539 | 556 | 585 |
The 70.2% token exact-match reflects that many tokens contain multiple ambiguous characters — all must be correct for the token to count. Per-character F1 on the emphatic pairs is substantially higher (78–94%).
Limitations
- One token at a time: the model has no sentence-level context. Emphatic harmony across words is not captured.
- Arabizi only: input must be Latin-script Moroccan Darija (Arabizi). Arabic or French input is not handled.
- h-initial only: the ح/ه distinction is predicted only for word-initial
h. Medial/finalhin consonant clusters (e.g.,kh,gh) is handled by upstream rules and not passed to this model. - Q/velar not covered: the ق/گ ambiguity is handled separately by an upstream classifier; this model covers s/d/t/h only.
License
MIT
- Downloads last month
- 74
Model tree for anasskabil/byt5-darija-emphatic
Unable to build the model tree, the base model loops to the model itself. Learn more.
Evaluation results
- Token-level Exact Match (v3, test set 4727 rows)self-reported70.210