kaa-gpt: A Proof of Concept for Karakalpak AI

kaa-gpt is an 80M parameter generative model designed to demonstrate that high-quality AI results are possible for low-resource languages using dedicated, manually curated data. This is a "Proof of Concept" (PoC) model, paving the way for future 1B+ parameter versions.

Project Vision

Most global LLMs overlook the Karakalpak language. This project aims to:

  1. Preserve: Digitally safeguard the Karakalpak linguistic heritage.
  2. Empower: Provide a foundation for Karakalpak-native AI tools.
  3. Scale: Prove that small-scale, high-quality data can outperform generic multilingual models for specific languages.

Data Source

The training data was manually collected and curated from publicly available sources (literature, news, and official records). It has been cleaned and formatted specifically for this task to ensure high linguistic fidelity.

Future Roadmap

  • Phase 1: 80M Parameter PoC (Current)
  • Phase 2: Expanded Data Collection
  • Phase 3: 1B Parameter General-Purpose Model
Downloads last month
20
Safetensors
Model size
82.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support