kaa-gpt: A Proof of Concept for Karakalpak AI

kaa-gpt is an 80M parameter generative model designed to demonstrate that high-quality AI results are possible for low-resource languages using dedicated, manually curated data. This is a "Proof of Concept" (PoC) model, paving the way for future 1B+ parameter versions.

Project Vision

Most global LLMs overlook the Karakalpak language. This project aims to:

Preserve: Digitally safeguard the Karakalpak linguistic heritage.
Empower: Provide a foundation for Karakalpak-native AI tools.
Scale: Prove that small-scale, high-quality data can outperform generic multilingual models for specific languages.

Data Source

The training data was manually collected and curated from publicly available sources (literature, news, and official records). It has been cleaned and formatted specifically for this task to ensure high linguistic fidelity.

Future Roadmap

Phase 1: 80M Parameter PoC (Current)
Phase 2: Expanded Data Collection
Phase 3: 1B Parameter General-Purpose Model

🌟 Support My Research

This project is a solo effort to digitize the Karakalpak language, built on a $250 laptop funded by manual labor. Your support helps me cover the costs of:

LLM APIs: (Claude 3.5 & Gemini 1.5) for high-quality data cleaning.
Compute: Renting GPUs for training the next version of Karakalpak models.

Support via Binance:

Binance Pay ID: 1207254817

Even a small contribution helps keep this project open-source and free for everyone.

Downloads last month: 9

Safetensors

Model size

82.9M params

Tensor type

F32