You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Til Core 0.5B

Til Core 0.5B is a 498-million-parameter Kazakh language model trained from scratch on a clean Kazakh corpus using a 256K morpheme-aware BPE tokenizer. It is a Qwen2-style decoder-only transformer built by TilQazyna as a compact, efficient foundation model for the Kazakh language.

Til — "language" in Kazakh. Til Core is the base on top of which task-specific Kazakh models (instruct, grammar correction, translation) can be fine-tuned.

Why a 256K morpheme-aware vocabulary?

Kazakh is highly agglutinative — a single root takes long chains of suffixes. Standard byte-level BPE fragments these into many sub-tokens, wasting context and parameters. Til Core uses a 256,000-token morpheme-aware BPE (stukenov/sozkz-morphbpe-256k-kk-v1) that aligns tokens with morphological boundaries, giving ~15–20% better compression on Kazakh text. The trade-off — a heavier embedding table — is absorbed by tying input/output embeddings and using a deeper-than-usual transformer body.

Model details

Architecture Qwen2 (decoder-only, SwiGLU, RoPE, GQA)
Parameters 497.8M (embedding ≈ 229M, transformer ≈ 268M)
Vocabulary 256,000 (morpheme-aware BPE)
Hidden size 896
Layers 18
Attention heads 14 (GQA, 2 KV heads)
Intermediate size 4864
Context length 32,768 (rope_theta = 1e6)
Tied embeddings yes
Precision bf16

Training

Data stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1 — pre-tokenized clean Kazakh (~1.44M sequences × 2048 tokens ≈ 2.94B tokens)
Tokens seen ≈ 5.88B (2 epochs)
Steps 11,222
Global batch 524,288 tokens/step (8 × 8 × grad-accum 4 × 2048)
Optimizer AdamW (β default), weight decay 0.1, grad clip 1.0
LR schedule 4e-4, cosine, 500 warmup steps
Sequence length 2048
Hardware 8 × NVIDIA H200 (140 GB), ~3h15m
Final eval loss 2.436 (validation), perplexity ≈ 11.4

Chinchilla-style budget: 498M params with ≈5.9B tokens (11.8 tokens/param).

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "TilQazyna/Til-Core-0.5B"
tok = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(repo, dtype=torch.bfloat16, device_map="auto").eval()

prompt = "Абай Құнанбайұлы — қазақтың"
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=60, do_sample=True,
                     temperature=0.8, top_p=0.9, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

The tokenizer is bundled with this repository (tokenizer.json, tokenizer_config.json).

Sample generations

Қазақстан Республикасының астанасы
→ … Астана қаласында орналасқан, Қазақстан Республикасы Президентінің
  резиденциясы. Сарайдың негізгі ғимараттары: «Ақорда» залы …

Абай Құнанбайұлы — қазақтың
→ … рухани мәдениетінің көрнекті өкілі. Ол – ақын, ағартушы, жазба
  әдебиетінің негізін салушы әрі дамытушы …

Жасанды интеллект дегеніміз —
→ … ақпаратты беру мен оны өңдеудің үздіксіз және тиімді жұмыс жасауын
  қамтамасыз ететін технологиялар жиынтығы.

Limitations

  • Base model, not instruction-tuned — it continues text, it does not follow chat instructions out of the box. Fine-tune for downstream tasks.
  • Trained on web/encyclopedic Kazakh, so it can emit corpus artifacts (URLs, site names, boilerplate).
  • No safety alignment — outputs are unfiltered.
  • Knowledge is limited to the training corpus.

Citation

@misc{tilcore05b2026,
  title  = {Til Core 0.5B: a morpheme-aware Kazakh language model},
  author = {TilQazyna},
  year   = {2026},
  url    = {https://huggingface.co/TilQazyna/Til-Core-0.5B}
}

Tokenizer: stukenov/sozkz-morphbpe-256k-kk-v1 · Dataset: stukenov/sozkz-corpus-tokenized-kk-morphbpe256k-v1

Downloads last month
232
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TilQazyna/Til-Core-0.5B

Finetunes
1 model

Dataset used to train TilQazyna/Til-Core-0.5B