Qwen3 models (123M/300M/600M) trained from scratch on 2.47B kk+ru tokens. Includes tokenizer, datasets, and checkpoints.
Saken Tukenov PRO
stukenov
AI & ML interests
None yet
Recent Activity
liked a model 6 days ago
AIDC-AI/Marco-Mini-Base liked a model 6 days ago
AxisCommunity/OrionPaxAI_1.0V updated a dataset 7 days ago
stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1Organizations
SozKZ Vocab: Kazakh Tokenizers
BPE and SentencePiece tokenizers trained on Kazakh text — 32K vocabularies
SozKZ MoE: Mixture of Experts
Mixture-of-Experts models for Kazakh — upcycled and domain-pretrained MoE architectures
SozKZ Core: Kazakh Language Models
Base, instruct, and balanced Kazakh language models trained from scratch — Llama (50M–600M), GPT2, Pythia architectures
-
stukenov/sozkz-core-llama-600m-kk-base-v1
Text Generation • 0.6B • Updated • 4 • 1 -
stukenov/sozkz-core-llama-600m-kk-instruct-v1
0.6B • Updated • 6 -
stukenov/sozkz-core-llama-300m-kk-base-v1
Text Generation • 0.3B • Updated • 1 -
stukenov/sozkz-core-llama-300m-kk-instruct-v1
Text Generation • 0.3B • Updated • 6
SozKZ Misc: TTS, Sentiment & Other
Miscellaneous Kazakh AI models and datasets — TTS, sentiment analysis, speech, benchmarks
SozKZ Corpora: Kazakh Training Datasets
Training corpora for Kazakh LLMs — raw, cleaned, deduplicated, tokenized, synthetic, and parallel datasets
SozKZ GEC: Kazakh Grammar Error Correction
Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models
-
stukenov/sozkz-core-llama-600m-kk-gec-v1
Text Generation • 0.6B • Updated • 1 -
stukenov/sozkz-core-llama-300m-kk-gec-v1
Text Generation • 0.3B • Updated • 2 -
stukenov/sozkz-core-llama-300m-kk-gec-v2a
Text Generation • 0.3B • Updated • 2 -
stukenov/sozkz-core-llama-300m-kk-gec-v2b
Text Generation • 0.3B • Updated • 3
EkiTil: Bilingual Kazakh-Russian Language Models
Qwen3 models (123M/300M/600M) trained from scratch on 2.47B kk+ru tokens. Includes tokenizer, datasets, and checkpoints.
SozKZ Misc: TTS, Sentiment & Other
Miscellaneous Kazakh AI models and datasets — TTS, sentiment analysis, speech, benchmarks
SozKZ Vocab: Kazakh Tokenizers
BPE and SentencePiece tokenizers trained on Kazakh text — 32K vocabularies
SozKZ Corpora: Kazakh Training Datasets
Training corpora for Kazakh LLMs — raw, cleaned, deduplicated, tokenized, synthetic, and parallel datasets
SozKZ MoE: Mixture of Experts
Mixture-of-Experts models for Kazakh — upcycled and domain-pretrained MoE architectures
SozKZ GEC: Kazakh Grammar Error Correction
Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models
-
stukenov/sozkz-core-llama-600m-kk-gec-v1
Text Generation • 0.6B • Updated • 1 -
stukenov/sozkz-core-llama-300m-kk-gec-v1
Text Generation • 0.3B • Updated • 2 -
stukenov/sozkz-core-llama-300m-kk-gec-v2a
Text Generation • 0.3B • Updated • 2 -
stukenov/sozkz-core-llama-300m-kk-gec-v2b
Text Generation • 0.3B • Updated • 3
SozKZ Core: Kazakh Language Models
Base, instruct, and balanced Kazakh language models trained from scratch — Llama (50M–600M), GPT2, Pythia architectures
-
stukenov/sozkz-core-llama-600m-kk-base-v1
Text Generation • 0.6B • Updated • 4 • 1 -
stukenov/sozkz-core-llama-600m-kk-instruct-v1
0.6B • Updated • 6 -
stukenov/sozkz-core-llama-300m-kk-base-v1
Text Generation • 0.3B • Updated • 1 -
stukenov/sozkz-core-llama-300m-kk-instruct-v1
Text Generation • 0.3B • Updated • 6