chagatai-model / README.md

chagatai-project

Update README.md

8fabb5c verified 3 months ago

preview code

raw

history blame contribute delete

3.41 kB

metadata

base_model:
  - facebook/nllb-200-distilled-600M
pipeline_tag: translation
language:
  - chg
  - kaz
  - eng
license: apache-2.0

Chagatai (Arabic) → English / Kazakh Translation Model

Model Description

This model is a fine-tuned NLLB-200-distilled-600M model for translation from
Chagatai (Arabic script) to English and Kazakh (Cyrillic).

Chagatai is a historical Turkic literary language used in Central Asia until the early 20th century.
The model focuses exclusively on Chagatai written in Arabic script.

Supported Translation Directions

Chagatai (Arabic script) → English
Chagatai (Arabic script) → Kazakh (Cyrillic)

Training Data

The model was trained on two combined datasets containing both word-level and sentence-level aligned translations.

1. Sözdikqor + QazCorpus Transliteration Lexicon

Source: QazCorpus platform
https://sozdikqor.kz/sozdik/?id=16#page-332
Origin: Digitized texts of Abulghazi Bahadur Khan
Languages Used:
- Chagatai (Arabic script)
- English
- Kazakh
Size: ~2,600 word-level entries

This dataset provides high-quality historical vocabulary aligned with modern Kazakh and English meanings.

2. Digitized Textbook Corpus

Source:
An Introduction to Chaghatay: A Graded Textbook for Reading Central Asian Sources
by Eric Schluessel
https://quod.lib.umich.edu/m/maize/images/mpub10110094.pdf
Structure: Chapter-based educational corpus
Languages Used:
- Chagatai (Arabic script) — chg_Arab
- English — eng_Latn
- Kazakh — kaz_Cyrl

Each chapter includes:

complete sentences, and
individual vocabulary items used in those sentences.

This structure allows the model to learn both lexical translations and contextual sentence usage.

Total Training Data

~5,300 aligned examples
- word-level
- sentence-level

Language Codes

Since Chagatai is not included in the standard NLLB language set, custom tokens were added.

Language	Script	Token	Notes
Chagatai	Arabic	chg_Arab	Custom token added
English	Latin	eng_Latn	Standard NLLB token
Kazakh	Cyrillic	kaz_Cyrl	Standard NLLB token

Note: chg_Latn token was also added but not used during training.

Training Details

Base Model: facebook/nllb-200-distilled-600M
Epochs: 5
Batch Size: 16
Learning Rate: 2e-5
Framework: Hugging Face Transformers
Hardware: NVIDIA A100 (Google Colab)

Performance

Evaluation was conducted on held-out test data (~30% of the dataset).

Direction	Exact Match	BLEU
Chagatai → English	12.86%	35.10
Chagatai → Kazakh	18.70%	41.19

Note:
Exact Match scores are relatively low due to:

synonym variation,
morphological differences,
article usage in English.

BLEU scores indicate strong performance for a low-resource historical language.

Acknowledgements

This work was conducted within the framework of the project:

BR28712621
“Handwritten Heritage of Kazakhstan: registration, restoration, scientific cataloging, digitization, and comprehensive codicological studies.”