arabnamer โ€” XGBoost Arabic name transliteration model

Pruned to 386 boosting rounds, ~38 MB gzipped, trained on 22,798 English-Arabic name pairs.

This is the model artifact used by the arabnamer Python library. It predicts the Arabic transliteration of a lowercase English name token via per-character multiclass classification (335 Arabic label tokens).

Quick use

pip install arabnamer
from arabnamer import translit
r = translit("Mohammed Ali")
print(r.arabic)   # 'ู…ุญู…ุฏ ุนู„ูŠ'

The wheel ships with this model bundled โ€” no download from Hugging Face needed at runtime. This HF page exists for discoverability and the inference widget.

Model details

Framework XGBoost 2.0+ (saved in UBJ binary format, gzipped)
Boosting rounds (trees ร— 335 classes) 386 ร— 335 = 129,310 trees
Max depth 12
Output classes 335 (Arabic label tokens + empty string for silent-char)
Input features 34 per character: padded char IDs, position, phonetic class, bigram/trigram IDs
File size 160 MB uncompressed .ubj, 38 MB gzipped .ubj.gz
Inference speed ~0.5 ms per 8-char token on CPU

Training

  • Dataset: arabnamer/arabic-name-pairs โ€” 22,798 ENโ†’AR name pairs
  • Training script: training/train_xgboost.py
  • Pruning pipeline: start at 800 rounds, use find_min_k.py to search for the smallest k whose per-name scores exactly match the baseline โ†’ result: 386 rounds
  • Hardware: any CPU with 4 GB RAM โ€” roughly 5 min on a modern laptop
  • Determinism: approximate (multi-threaded tree_method="hist"); use n_jobs=1 for bit-perfect reproducibility

Evaluation

Metric Value
Average lenient similarity 98.4
Pass rate โ‰ฅ 70 25 / 25
Pass rate โ‰ฅ 90 24 / 25
Exact match (= 100) 21 / 25

Benchmark: 25 MENA-region names. Scoring is lenient (tashkeel stripped, hamza/taa-marbuta/alef-maksura unified, then max(fuzz.ratio, fuzz.partial_ratio)).

Per-name results: benchmarks/REPORT.md

Intended uses

  • Arabic name transliteration for search, KYC, entity resolution, library cataloguing
  • Training-data augmentation โ€” use this model's predictions as silver labels for larger corpora
  • Research into trade-offs between classical ML and neural transliteration

Limitations

  • Single-token input required โ€” multi-token names are handled by the library wrapper but this model itself expects one English word at a time
  • Max length 13 characters โ€” tokens longer are returned unchanged
  • MENA focus โ€” trained primarily on Arab / Persian / Turkic names transliterated into English; less accurate for unusual African or South Asian names with Arabic transliteration traditions
  • Dialect neutrality โ€” outputs Modern Standard Arabic spelling; colloquial or regional spellings (Egyptian, Maghrebi) are not modeled
  • No Arabicโ†’English direction โ€” this model only handles Englishโ†’Arabic

Citation

@software{yousef_arabnamer_2026,
  author  = {Yousef, Elsayed},
  title   = {arabnamer: Arabic name transliteration and similarity},
  year    = {2026},
  version = {0.1.0},
  url     = {https://github.com/sayedyousef/arabnamer},
}

License

CC-BY-4.0 โ€” attribution required. Commercial use permitted.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using Sayedyousef/arabnamer-xgboost 1