dialect-router-v0.1

A lightweight Arabic dialect identification model that classifies input text into one of 11 Arabic dialect / language codes. It is used as the routing backbone in the Lahgtna pipeline to automatically select the correct voice reference and Chatterbox language token for speech synthesis.


Model Details

Property Value
Architecture Transformer encoder (sequence classification head)
Task Multi-class text classification
Input Raw Arabic text (up to 512 tokens)
Output One of 11 dialect codes
Language Arabic (ar)
License MIT

Dialect Labels

Label Dialect Region
eg Egyptian Egypt
sa Saudi Saudi Arabia
mo Moroccan (Darija) Morocco
iq Iraqi Iraq
sd Sudanese Sudan
tn Tunisian Tunisia
lb Lebanese Lebanon
sy Syrian Syria
ly Libyan Libya
ps Palestinian Palestine
ar Modern Standard Arabic (MSA) โ€”

Intended Use

Primary use

Dialect-aware TTS routing โ€” given an Arabic utterance, predict the dialect so the correct speaker reference audio and Chatterbox language code can be selected automatically.

Secondary use

Standalone Arabic dialect identification for NLP pipelines, content filtering, dataset analysis, or any application that needs to distinguish Arabic dialects programmatically.

Out-of-scope use

  • Non-Arabic languages
  • Code-switched text (Arabic + English mixed)
  • Dialect intensity scoring or fine-grained subdialect classification
  • High-stakes decisions without human review

How to Use

Direct inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "oddadmix/dialect-router-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "ุงู‡ ูŠุงุฑุงุณูŠ ุงู„ูˆุงุญุฏ ุฏู…ุงุบู‡ ูˆุฌุนุงู‡"

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

pred_id = torch.argmax(logits, dim=-1).item()
dialect = model.config.id2label[pred_id]
print(dialect)  # e.g. "eg"

With the Transformers pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="oddadmix/dialect-router-v0.1",
)
result = classifier("ุงู‡ ูŠุงุฑุงุณูŠ ุงู„ูˆุงุญุฏ ุฏู…ุงุบู‡ ูˆุฌุนุงู‡")
print(result)
# [{'label': 'eg', 'score': 0.94}]

Inside Lahgtna TTS

from inference import run_pipeline

# Dialect is detected automatically
run_pipeline(
    text="ุงู‡ ูŠุงุฑุงุณูŠ ุงู„ูˆุงุญุฏ ุฏู…ุงุบู‡ ูˆุฌุนุงู‡",
    output_path="output.wav",
)

Limitations & Biases

  • Short texts (< 5 tokens) may produce unreliable predictions โ€” the model benefits from sentence-length input.
  • Code-switched text (e.g. Arabic + French in Moroccan Darija, or Arabic + English) may confuse the classifier.
  • Dialect continuum โ€” dialects from geographically adjacent regions (e.g. sy / lb, eg / ly) may be confused by the model.
  • Corpus bias โ€” label distribution in training data may not reflect real-world dialect prevalence; some dialects (e.g. sd, ly) may have lower recall.
  • This model should not be used for identity classification of individuals.

Citation

If you use this model in your research or product, please cite:

@misc{lahgtna-dialect-router-2025,
  title  = {dialect-router-v0.1: Arabic Dialect Identification for TTS Routing},
  author = {Oddadmix},
  year   = {2025},
  url    = {https://huggingface.co/oddadmix/dialect-router-v0.1}
}

Related Resources

Downloads last month
35
Safetensors
Model size
11.6M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for oddadmix/dialect-router-v0.1

Finetuned
(3)
this model

Spaces using oddadmix/dialect-router-v0.1 2

Evaluation results