dialect-router-v0.1

A lightweight Arabic dialect identification model that classifies input text into one of 11 Arabic dialect / language codes. It is used as the routing backbone in the Lahgtna pipeline to automatically select the correct voice reference and Chatterbox language token for speech synthesis.

Model Details

Property	Value
Architecture	Transformer encoder (sequence classification head)
Task	Multi-class text classification
Input	Raw Arabic text (up to 512 tokens)
Output	One of 11 dialect codes
Language	Arabic (`ar`)
License	MIT

Dialect Labels

Label	Dialect	Region
`eg`	Egyptian	Egypt
`sa`	Saudi	Saudi Arabia
`mo`	Moroccan (Darija)	Morocco
`iq`	Iraqi	Iraq
`sd`	Sudanese	Sudan
`tn`	Tunisian	Tunisia
`lb`	Lebanese	Lebanon
`sy`	Syrian	Syria
`ly`	Libyan	Libya
`ps`	Palestinian	Palestine
`ar`	Modern Standard Arabic (MSA)	—

Intended Use

Primary use

Dialect-aware TTS routing — given an Arabic utterance, predict the dialect so the correct speaker reference audio and Chatterbox language code can be selected automatically.

Secondary use

Standalone Arabic dialect identification for NLP pipelines, content filtering, dataset analysis, or any application that needs to distinguish Arabic dialects programmatically.

Out-of-scope use

Non-Arabic languages
Code-switched text (Arabic + English mixed)
Dialect intensity scoring or fine-grained subdialect classification
High-stakes decisions without human review

How to Use

Direct inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "oddadmix/dialect-router-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()

text = "اه ياراسي الواحد دماغه وجعاه"

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

pred_id = torch.argmax(logits, dim=-1).item()
dialect = model.config.id2label[pred_id]
print(dialect)  # e.g. "eg"

With the Transformers pipeline

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="oddadmix/dialect-router-v0.1",
)
result = classifier("اه ياراسي الواحد دماغه وجعاه")
print(result)
# [{'label': 'eg', 'score': 0.94}]

Inside Lahgtna TTS

from inference import run_pipeline

# Dialect is detected automatically
run_pipeline(
    text="اه ياراسي الواحد دماغه وجعاه",
    output_path="output.wav",
)

Limitations & Biases

Short texts (< 5 tokens) may produce unreliable predictions — the model benefits from sentence-length input.
Code-switched text (e.g. Arabic + French in Moroccan Darija, or Arabic + English) may confuse the classifier.
Dialect continuum — dialects from geographically adjacent regions (e.g. sy / lb, eg / ly) may be confused by the model.
Corpus bias — label distribution in training data may not reflect real-world dialect prevalence; some dialects (e.g. sd, ly) may have lower recall.
This model should not be used for identity classification of individuals.

Citation

If you use this model in your research or product, please cite:

@misc{lahgtna-dialect-router-2025,
  title  = {dialect-router-v0.1: Arabic Dialect Identification for TTS Routing},
  author = {Oddadmix},
  year   = {2025},
  url    = {https://huggingface.co/oddadmix/dialect-router-v0.1}
}

Related Resources

🔊 Lahgtna TTS checkpoint → oddadmix/lahgtna-chatterbox-v1
💻 Inference code → lahgtna-tts on GitHub
🗣️ Chatterbox backbone → resemble-ai/chatterbox

Downloads last month: 35

Safetensors

Model size

11.6M params

Tensor type

F32

Model tree for oddadmix/dialect-router-v0.1

Base model

asafaya/bert-mini-arabic

Finetuned

(3)

this model

Spaces using oddadmix/dialect-router-v0.1 2

Evaluation results

Accuracy
self-reported
Macro F1
self-reported