chagatai-model / README.md
chagatai-project's picture
Update README.md
8fabb5c verified
metadata
base_model:
  - facebook/nllb-200-distilled-600M
pipeline_tag: translation
language:
  - chg
  - kaz
  - eng
license: apache-2.0

Chagatai (Arabic) → English / Kazakh Translation Model

Model Description

This model is a fine-tuned NLLB-200-distilled-600M model for translation from
Chagatai (Arabic script) to English and Kazakh (Cyrillic).

Chagatai is a historical Turkic literary language used in Central Asia until the early 20th century.
The model focuses exclusively on Chagatai written in Arabic script.

Supported Translation Directions

  • Chagatai (Arabic script) → English
  • Chagatai (Arabic script) → Kazakh (Cyrillic)

Training Data

The model was trained on two combined datasets containing both word-level and sentence-level aligned translations.

1. Sözdikqor + QazCorpus Transliteration Lexicon

This dataset provides high-quality historical vocabulary aligned with modern Kazakh and English meanings.

2. Digitized Textbook Corpus

  • Source:
    An Introduction to Chaghatay: A Graded Textbook for Reading Central Asian Sources
    by Eric Schluessel
    https://quod.lib.umich.edu/m/maize/images/mpub10110094.pdf
  • Structure: Chapter-based educational corpus
  • Languages Used:
    • Chagatai (Arabic script) — chg_Arab
    • English — eng_Latn
    • Kazakh — kaz_Cyrl

Each chapter includes:

  • complete sentences, and
  • individual vocabulary items used in those sentences.

This structure allows the model to learn both lexical translations and contextual sentence usage.

Total Training Data

  • ~5,300 aligned examples
    • word-level
    • sentence-level

Language Codes

Since Chagatai is not included in the standard NLLB language set, custom tokens were added.

Language Script Token Notes
Chagatai Arabic chg_Arab Custom token added
English Latin eng_Latn Standard NLLB token
Kazakh Cyrillic kaz_Cyrl Standard NLLB token

Note: chg_Latn token was also added but not used during training.

Training Details

  • Base Model: facebook/nllb-200-distilled-600M
  • Epochs: 5
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Framework: Hugging Face Transformers
  • Hardware: NVIDIA A100 (Google Colab)

Performance

Evaluation was conducted on held-out test data (~30% of the dataset).

Direction Exact Match BLEU
Chagatai → English 12.86% 35.10
Chagatai → Kazakh 18.70% 41.19

Note:
Exact Match scores are relatively low due to:

  • synonym variation,
  • morphological differences,
  • article usage in English.

BLEU scores indicate strong performance for a low-resource historical language.

Acknowledgements

This work was conducted within the framework of the project:

BR28712621
“Handwritten Heritage of Kazakhstan: registration, restoration, scientific cataloging, digitization, and comprehensive codicological studies.”