Chagatai (Arabic) → English / Kazakh Translation Model
Model Description
This model is a fine-tuned NLLB-200-distilled-600M model for translation from
Chagatai (Arabic script) to English and Kazakh (Cyrillic).
Chagatai is a historical Turkic literary language used in Central Asia until the early 20th century.
The model focuses exclusively on Chagatai written in Arabic script.
Supported Translation Directions
- Chagatai (Arabic script) → English
- Chagatai (Arabic script) → Kazakh (Cyrillic)
Training Data
The model was trained on two combined datasets containing both word-level and sentence-level aligned translations.
1. Sözdikqor + QazCorpus Transliteration Lexicon
- Source: QazCorpus platform
https://sozdikqor.kz/sozdik/?id=16#page-332 - Origin: Digitized texts of Abulghazi Bahadur Khan
- Languages Used:
- Chagatai (Arabic script)
- English
- Kazakh
- Size: ~2,600 word-level entries
This dataset provides high-quality historical vocabulary aligned with modern Kazakh and English meanings.
2. Digitized Textbook Corpus
- Source:
An Introduction to Chaghatay: A Graded Textbook for Reading Central Asian Sources
by Eric Schluessel
https://quod.lib.umich.edu/m/maize/images/mpub10110094.pdf - Structure: Chapter-based educational corpus
- Languages Used:
- Chagatai (Arabic script) —
chg_Arab - English —
eng_Latn - Kazakh —
kaz_Cyrl
- Chagatai (Arabic script) —
Each chapter includes:
- complete sentences, and
- individual vocabulary items used in those sentences.
This structure allows the model to learn both lexical translations and contextual sentence usage.
Total Training Data
- ~5,300 aligned examples
- word-level
- sentence-level
Language Codes
Since Chagatai is not included in the standard NLLB language set, custom tokens were added.
| Language | Script | Token | Notes |
|---|---|---|---|
| Chagatai | Arabic | chg_Arab | Custom token added |
| English | Latin | eng_Latn | Standard NLLB token |
| Kazakh | Cyrillic | kaz_Cyrl | Standard NLLB token |
Note: chg_Latn token was also added but not used during training.
Training Details
- Base Model: facebook/nllb-200-distilled-600M
- Epochs: 5
- Batch Size: 16
- Learning Rate: 2e-5
- Framework: Hugging Face Transformers
- Hardware: NVIDIA A100 (Google Colab)
Performance
Evaluation was conducted on held-out test data (~30% of the dataset).
| Direction | Exact Match | BLEU |
|---|---|---|
| Chagatai → English | 12.86% | 35.10 |
| Chagatai → Kazakh | 18.70% | 41.19 |
Note:
Exact Match scores are relatively low due to:
- synonym variation,
- morphological differences,
- article usage in English.
BLEU scores indicate strong performance for a low-resource historical language.
Acknowledgements
This work was conducted within the framework of the project:
BR28712621
“Handwritten Heritage of Kazakhstan: registration, restoration, scientific cataloging, digitization, and comprehensive codicological studies.”
- Downloads last month
- -
Model tree for chagatai-project/chagatai-model
Base model
facebook/nllb-200-distilled-600M