arabnamer — XGBoost Arabic name transliteration model

Pruned to 386 boosting rounds, ~38 MB gzipped, trained on 22,798 English-Arabic name pairs.

This is the model artifact used by the arabnamer Python library. It predicts the Arabic transliteration of a lowercase English name token via per-character multiclass classification (335 Arabic label tokens).

Quick use

pip install arabnamer

from arabnamer import translit
r = translit("Mohammed Ali")
print(r.arabic)   # 'محمد علي'

The wheel ships with this model bundled — no download from Hugging Face needed at runtime. This HF page exists for discoverability and the inference widget.

Model details


Framework	XGBoost 2.0+ (saved in UBJ binary format, gzipped)
Boosting rounds (trees × 335 classes)	386 × 335 = 129,310 trees
Max depth	12
Output classes	335 (Arabic label tokens + empty string for silent-char)
Input features	34 per character: padded char IDs, position, phonetic class, bigram/trigram IDs
File size	160 MB uncompressed `.ubj`, 38 MB gzipped `.ubj.gz`
Inference speed	~0.5 ms per 8-char token on CPU

Training

Dataset: arabnamer/arabic-name-pairs — 22,798 EN→AR name pairs
Training script: training/train_xgboost.py
Pruning pipeline: start at 800 rounds, use find_min_k.py to search for the smallest k whose per-name scores exactly match the baseline → result: 386 rounds
Hardware: any CPU with 4 GB RAM — roughly 5 min on a modern laptop
Determinism: approximate (multi-threaded tree_method="hist"); use n_jobs=1 for bit-perfect reproducibility

Evaluation

Metric	Value
Average lenient similarity	98.4
Pass rate ≥ 70	25 / 25
Pass rate ≥ 90	24 / 25
Exact match (= 100)	21 / 25

Benchmark: 25 MENA-region names. Scoring is lenient (tashkeel stripped, hamza/taa-marbuta/alef-maksura unified, then max(fuzz.ratio, fuzz.partial_ratio)).

Per-name results: benchmarks/REPORT.md

Intended uses

Arabic name transliteration for search, KYC, entity resolution, library cataloguing
Training-data augmentation — use this model's predictions as silver labels for larger corpora
Research into trade-offs between classical ML and neural transliteration

Limitations

Single-token input required — multi-token names are handled by the library wrapper but this model itself expects one English word at a time
Max length 13 characters — tokens longer are returned unchanged
MENA focus — trained primarily on Arab / Persian / Turkic names transliterated into English; less accurate for unusual African or South Asian names with Arabic transliteration traditions
Dialect neutrality — outputs Modern Standard Arabic spelling; colloquial or regional spellings (Egyptian, Maghrebi) are not modeled
No Arabic→English direction — this model only handles English→Arabic

Citation

@software{yousef_arabnamer_2026,
  author  = {Yousef, Elsayed},
  title   = {arabnamer: Arabic name transliteration and similarity},
  year    = {2026},
  version = {0.1.0},
  url     = {https://github.com/sayedyousef/arabnamer},
}

License

CC-BY-4.0 — attribution required. Commercial use permitted.

Downloads last month: -; Downloads are not tracked for this model. How to track

Sayedyousef
/

arabnamer-xgboost