You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Chagatai (Arabic) → English / Kazakh Translation Model

Model Description

This model is a fine-tuned NLLB-200-distilled-600M model for translation from
Chagatai (Arabic script) to English and Kazakh (Cyrillic).

Chagatai is a historical Turkic literary language used in Central Asia until the early 20th century.
The model focuses exclusively on Chagatai written in Arabic script.

Supported Translation Directions

Chagatai (Arabic script) → English
Chagatai (Arabic script) → Kazakh (Cyrillic)

Training Data

The model was trained on two combined datasets containing both word-level and sentence-level aligned translations.

1. Sözdikqor + QazCorpus Transliteration Lexicon

Source: QazCorpus platform
https://sozdikqor.kz/sozdik/?id=16#page-332
Origin: Digitized texts of Abulghazi Bahadur Khan
Languages Used:
- Chagatai (Arabic script)
- English
- Kazakh
Size: ~2,600 word-level entries

This dataset provides high-quality historical vocabulary aligned with modern Kazakh and English meanings.

2. Digitized Textbook Corpus

Source:
An Introduction to Chaghatay: A Graded Textbook for Reading Central Asian Sources
by Eric Schluessel
https://quod.lib.umich.edu/m/maize/images/mpub10110094.pdf
Structure: Chapter-based educational corpus
Languages Used:
- Chagatai (Arabic script) — chg_Arab
- English — eng_Latn
- Kazakh — kaz_Cyrl

Each chapter includes:

complete sentences, and
individual vocabulary items used in those sentences.

This structure allows the model to learn both lexical translations and contextual sentence usage.

Total Training Data

~5,300 aligned examples
- word-level
- sentence-level

Language Codes

Since Chagatai is not included in the standard NLLB language set, custom tokens were added.

Language	Script	Token	Notes
Chagatai	Arabic	chg_Arab	Custom token added
English	Latin	eng_Latn	Standard NLLB token
Kazakh	Cyrillic	kaz_Cyrl	Standard NLLB token

Note: chg_Latn token was also added but not used during training.

Training Details

Base Model: facebook/nllb-200-distilled-600M
Epochs: 5
Batch Size: 16
Learning Rate: 2e-5
Framework: Hugging Face Transformers
Hardware: NVIDIA A100 (Google Colab)

Performance

Evaluation was conducted on held-out test data (~30% of the dataset).

Direction	Exact Match	BLEU
Chagatai → English	12.86%	35.10
Chagatai → Kazakh	18.70%	41.19

Note:
Exact Match scores are relatively low due to:

synonym variation,
morphological differences,
article usage in English.

BLEU scores indicate strong performance for a low-resource historical language.

Acknowledgements

This work was conducted within the framework of the project:

BR28712621
“Handwritten Heritage of Kazakhstan: registration, restoration, scientific cataloging, digitization, and comprehensive codicological studies.”

Downloads last month: -

Model tree for chagatai-project/chagatai-model

Base model

facebook/nllb-200-distilled-600M

Finetuned

(294)

this model