You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Til Core 0.5B

Til Core 0.5B is a 498-million-parameter Kazakh language model trained from scratch on a clean Kazakh corpus using a 256K morpheme-aware BPE tokenizer. It is a Qwen2-style decoder-only transformer built by TilQazyna as a compact, efficient foundation model for the Kazakh language.

Til — "language" in Kazakh. Til Core is the base on top of which task-specific Kazakh models (instruct, grammar correction, translation) can be fine-tuned.

Why a 256K morpheme-aware vocabulary?

Kazakh is highly agglutinative — a single root takes long chains of suffixes. Standard byte-level BPE fragments these into many sub-tokens, wasting context and parameters. Til Core uses a 256,000-token morpheme-aware BPE (stukenov/sozkz-morphbpe-256k-kk-v1) that aligns tokens with morphological boundaries, giving ~15–20% better compression on Kazakh text. The trade-off — a heavier embedding table — is absorbed by tying input/output embeddings and using a deeper-than-usual transformer body.

Model details


Architecture	Qwen2 (decoder-only, SwiGLU, RoPE, GQA)
Parameters	497.8M (embedding ≈ 229M, transformer ≈ 268M)
Vocabulary	256,000 (morpheme-aware BPE)
Hidden size	896
Layers	18
Attention heads	14 (GQA, 2 KV heads)
Intermediate size	4864
Context length	32,768 (`rope_theta` = 1e6)
Tied embeddings	yes
Precision	bf16

Training


Data	`stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1` — pre-tokenized clean Kazakh (~1.44M sequences × 2048 tokens ≈ 2.94B tokens)
Tokens seen	≈ 5.88B (2 epochs)
Steps	11,222
Global batch	524,288 tokens/step (8 × 8 × grad-accum 4 × 2048)
Optimizer	AdamW (β default), weight decay 0.1, grad clip 1.0
LR schedule	4e-4, cosine, 500 warmup steps
Sequence length	2048
Hardware	8 × NVIDIA H200 (140 GB), ~3h15m
Final eval loss	2.436 (validation), perplexity ≈ 11.4

Chinchilla-style budget: ~~498M params with ≈5.9B tokens (~~11.8 tokens/param).

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-Core-0.5B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()

prompt = "Абай Құнанбайұлы — қазақтың"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=60, do_sample=True,
                     temperature=0.8, top_p=0.9, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

The tokenizer is bundled with this repository (tokenizer.json, tokenizer_config.json).

Sample generations

Қазақстан Республикасының астанасы
→ … Астана қаласында орналасқан, Қазақстан Республикасы Президентінің
  резиденциясы. Сарайдың негізгі ғимараттары: «Ақорда» залы …

Абай Құнанбайұлы — қазақтың
→ … рухани мәдениетінің көрнекті өкілі. Ол – ақын, ағартушы, жазба
  әдебиетінің негізін салушы әрі дамытушы …

Жасанды интеллект дегеніміз —
→ … ақпаратты беру мен оны өңдеудің үздіксіз және тиімді жұмыс жасауын
  қамтамасыз ететін технологиялар жиынтығы.

Limitations

Base model, not instruction-tuned — it continues text, it does not follow chat instructions out of the box. Fine-tune for downstream tasks.
Trained on web/encyclopedic Kazakh, so it can emit corpus artifacts (URLs, site names, boilerplate).
No safety alignment — outputs are unfiltered.
Knowledge is limited to the training corpus.

Citation

@misc{tilcore05b2026,
  title  = {Til Core 0.5B: a morpheme-aware Kazakh language model},
  author = {TilQazyna},
  year   = {2026},
  url    = {https://huggingface.co/TilQazyna/Til-Core-0.5B}
}

Tokenizer: stukenov/sozkz-morphbpe-256k-kk-v1 · Dataset: stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1

Downloads last month: 232

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for TilQazyna/Til-Core-0.5B

Finetunes

1 model

TilQazyna
/

Til-Core-0.5B