byt5-darija-emphatic

ByT5-base fine-tuned to predict emphatic consonants in Moroccan Darija Arabizi-to-Arabic transliteration.

What it does

Moroccan Darija uses emphatic (pharyngealized) consonants that create minimal pairs in Arabizi Latin script:

Arabizi letter Plain Arabic Emphatic Arabic
s س ص
d د ض
t ت ط
h (word-initial) ه ح

This model takes an Arabizi token and outputs the Arabic script form with the correct emphatic or plain consonants — a task that rule-based systems cannot reliably solve without phonological context.

Examples:

Input (Arabizi) Output (Arabic) Note
sghir صغير s → ص (emphatic)
ananas اناناس s → س (plain)
badhik بالضحك d → ض (emphatic)
diyal ديال d → د (plain)
alwatania الوطنية t → ط (emphatic)
alklimat الكلمات t → ت (plain)
alhajja الحاجة h → ح (emphatic)
huma هوما h → ه (plain)

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "anasskabil/byt5-darija-emphatic"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
model.eval()

def predict(arabizi_token: str) -> str:
    inputs = tokenizer(
        arabizi_token.lower(),
        return_tensors="pt",
        truncation=True,
        max_length=32,
    )
    with torch.inference_mode():
        out = model.generate(**inputs, max_length=24, num_beams=1, do_sample=False)
    return tokenizer.decode(out[0], skip_special_tokens=True)

print(predict("sghir"))      # صغير     (emphatic ص)
print(predict("ananas"))     # اناناس   (plain س)
print(predict("alhajja"))    # الحاجة   (emphatic ح)
print(predict("alklimat"))   # الكلمات  (plain ت)

Input: a single lowercased Arabizi word (Latin-script Moroccan Arabic).
Output: the Arabic script transliteration with emphatic or plain consonants resolved.
The model does not add diacritics (harakat) and handles one token at a time.

Architecture

  • Base: google/byt5-base (582M parameters, byte-level, no tokenizer vocabulary needed)
  • Architecture: T5ForConditionalGeneration (encoder-decoder)
  • Input max length: 32 bytes
  • Output max length: 24 bytes

ByT5 is well-suited to this task because it operates at the byte/character level, which aligns with the character-to-character nature of emphatic prediction.

Training

This is v3, fine-tuned from the previous published checkpoint (anasskabil/byt5-darija-emphatic v1).

Hyperparameter v1 (google/byt5-base) v3 (this model)
Base model google/byt5-base anasskabil/byt5-darija-emphatic
Epochs 5 2
Learning rate 1e-4 5e-5
Batch size 32 32
Max input length 32 32
Max output length 24 24

Dataset: DODa-derived Arabizi/Arabic word pairs filtered to emphatic-ambiguous tokens, supplemented with reviewed corrections weighted by source quality.

Evaluation

Evaluated on a held-out test set of 4,727 tokens (DODa-derived, no overlap with training data).

Overall exact match (all characters must be correct):

Split Tokens Exact Match
Test 4,727 70.2% (3,319 / 4,727)

Per-pair Precision / Recall / F1 on emphatic consonants:

Pair Precision Recall F1 TP Total gold Total predicted
s / ص 78.9% 83.7% 81.3% 180 215 228
d / ض 72.8% 85.1% 78.5% 126 148 173
t / ط 75.8% 86.7% 80.9% 216 249 285
h / ح 92.1% 96.9% 94.5% 539 556 585

The 70.2% token exact-match reflects that many tokens contain multiple ambiguous characters — all must be correct for the token to count. Per-character F1 on the emphatic pairs is substantially higher (78–94%).

Limitations

  • One token at a time: the model has no sentence-level context. Emphatic harmony across words is not captured.
  • Arabizi only: input must be Latin-script Moroccan Darija (Arabizi). Arabic or French input is not handled.
  • h-initial only: the ح/ه distinction is predicted only for word-initial h. Medial/final h in consonant clusters (e.g., kh, gh) is handled by upstream rules and not passed to this model.
  • Q/velar not covered: the ق/گ ambiguity is handled separately by an upstream classifier; this model covers s/d/t/h only.

License

MIT

Downloads last month
74
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anasskabil/byt5-darija-emphatic

Unable to build the model tree, the base model loops to the model itself. Learn more.

Evaluation results

  • Token-level Exact Match (v3, test set 4727 rows)
    self-reported
    70.210