dialect-router-v0.1
A lightweight Arabic dialect identification model that classifies input text into one of 11 Arabic dialect / language codes. It is used as the routing backbone in the Lahgtna pipeline to automatically select the correct voice reference and Chatterbox language token for speech synthesis.
Model Details
| Property | Value |
|---|---|
| Architecture | Transformer encoder (sequence classification head) |
| Task | Multi-class text classification |
| Input | Raw Arabic text (up to 512 tokens) |
| Output | One of 11 dialect codes |
| Language | Arabic (ar) |
| License | MIT |
Dialect Labels
| Label | Dialect | Region |
|---|---|---|
eg |
Egyptian | Egypt |
sa |
Saudi | Saudi Arabia |
mo |
Moroccan (Darija) | Morocco |
iq |
Iraqi | Iraq |
sd |
Sudanese | Sudan |
tn |
Tunisian | Tunisia |
lb |
Lebanese | Lebanon |
sy |
Syrian | Syria |
ly |
Libyan | Libya |
ps |
Palestinian | Palestine |
ar |
Modern Standard Arabic (MSA) | โ |
Intended Use
Primary use
Dialect-aware TTS routing โ given an Arabic utterance, predict the dialect so the correct speaker reference audio and Chatterbox language code can be selected automatically.
Secondary use
Standalone Arabic dialect identification for NLP pipelines, content filtering, dataset analysis, or any application that needs to distinguish Arabic dialects programmatically.
Out-of-scope use
- Non-Arabic languages
- Code-switched text (Arabic + English mixed)
- Dialect intensity scoring or fine-grained subdialect classification
- High-stakes decisions without human review
How to Use
Direct inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_id = "oddadmix/dialect-router-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
model.eval()
text = "ุงู ูุงุฑุงุณู ุงููุงุญุฏ ุฏู
ุงุบู ูุฌุนุงู"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
pred_id = torch.argmax(logits, dim=-1).item()
dialect = model.config.id2label[pred_id]
print(dialect) # e.g. "eg"
With the Transformers pipeline
from transformers import pipeline
classifier = pipeline(
"text-classification",
model="oddadmix/dialect-router-v0.1",
)
result = classifier("ุงู ูุงุฑุงุณู ุงููุงุญุฏ ุฏู
ุงุบู ูุฌุนุงู")
print(result)
# [{'label': 'eg', 'score': 0.94}]
Inside Lahgtna TTS
from inference import run_pipeline
# Dialect is detected automatically
run_pipeline(
text="ุงู ูุงุฑุงุณู ุงููุงุญุฏ ุฏู
ุงุบู ูุฌุนุงู",
output_path="output.wav",
)
Limitations & Biases
- Short texts (< 5 tokens) may produce unreliable predictions โ the model benefits from sentence-length input.
- Code-switched text (e.g. Arabic + French in Moroccan Darija, or Arabic + English) may confuse the classifier.
- Dialect continuum โ dialects from geographically adjacent regions (e.g.
sy/lb,eg/ly) may be confused by the model. - Corpus bias โ label distribution in training data may not reflect real-world dialect prevalence; some dialects (e.g.
sd,ly) may have lower recall. - This model should not be used for identity classification of individuals.
Citation
If you use this model in your research or product, please cite:
@misc{lahgtna-dialect-router-2025,
title = {dialect-router-v0.1: Arabic Dialect Identification for TTS Routing},
author = {Oddadmix},
year = {2025},
url = {https://huggingface.co/oddadmix/dialect-router-v0.1}
}
Related Resources
- ๐ Lahgtna TTS checkpoint โ
oddadmix/lahgtna-chatterbox-v1 - ๐ป Inference code โ lahgtna-tts on GitHub
- ๐ฃ๏ธ Chatterbox backbone โ resemble-ai/chatterbox
- Downloads last month
- 35
Model tree for oddadmix/dialect-router-v0.1
Base model
asafaya/bert-mini-arabicSpaces using oddadmix/dialect-router-v0.1 2
Evaluation results
- Accuracyself-reported
- Macro F1self-reported