You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Chagatai (Arabic) → English / Kazakh Translation Model

Model Description

This model is a fine-tuned NLLB-200-distilled-600M model for translation from
Chagatai (Arabic script) to English and Kazakh (Cyrillic).

Chagatai is a historical Turkic literary language used in Central Asia until the early 20th century.
The model focuses exclusively on Chagatai written in Arabic script.

Supported Translation Directions

  • Chagatai (Arabic script) → English
  • Chagatai (Arabic script) → Kazakh (Cyrillic)

Training Data

The model was trained on two combined datasets containing both word-level and sentence-level aligned translations.

1. Sözdikqor + QazCorpus Transliteration Lexicon

This dataset provides high-quality historical vocabulary aligned with modern Kazakh and English meanings.

2. Digitized Textbook Corpus

  • Source:
    An Introduction to Chaghatay: A Graded Textbook for Reading Central Asian Sources
    by Eric Schluessel
    https://quod.lib.umich.edu/m/maize/images/mpub10110094.pdf
  • Structure: Chapter-based educational corpus
  • Languages Used:
    • Chagatai (Arabic script) — chg_Arab
    • English — eng_Latn
    • Kazakh — kaz_Cyrl

Each chapter includes:

  • complete sentences, and
  • individual vocabulary items used in those sentences.

This structure allows the model to learn both lexical translations and contextual sentence usage.

Total Training Data

  • ~5,300 aligned examples
    • word-level
    • sentence-level

Language Codes

Since Chagatai is not included in the standard NLLB language set, custom tokens were added.

Language Script Token Notes
Chagatai Arabic chg_Arab Custom token added
English Latin eng_Latn Standard NLLB token
Kazakh Cyrillic kaz_Cyrl Standard NLLB token

Note: chg_Latn token was also added but not used during training.

Training Details

  • Base Model: facebook/nllb-200-distilled-600M
  • Epochs: 5
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Framework: Hugging Face Transformers
  • Hardware: NVIDIA A100 (Google Colab)

Performance

Evaluation was conducted on held-out test data (~30% of the dataset).

Direction Exact Match BLEU
Chagatai → English 12.86% 35.10
Chagatai → Kazakh 18.70% 41.19

Note:
Exact Match scores are relatively low due to:

  • synonym variation,
  • morphological differences,
  • article usage in English.

BLEU scores indicate strong performance for a low-resource historical language.

Acknowledgements

This work was conducted within the framework of the project:

BR28712621
“Handwritten Heritage of Kazakhstan: registration, restoration, scientific cataloging, digitization, and comprehensive codicological studies.”

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chagatai-project/chagatai-model

Finetuned
(242)
this model