You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Til Core 1B (base)

A 1.25B-parameter Kazakh base language model, pre-trained from scratch on a deduplicated Kazakh web/text corpus with a 256k morpheme-aware BPE tokenizer (stukenov/sozkz-morphbpe-256k-kk-v1).

This is a base (non-instruct) model — it completes text, it does not follow chat instructions. An instruct version is planned (see Roadmap).

Model details

Architecture Llama-style decoder (RoPE, RMSNorm, SwiGLU, GQA)
Parameters 1.246 B (tied input/output embeddings)
Hidden / layers 2048 / 16
Attention heads 32 query / 8 KV (GQA)
Intermediate 5632
Context length 2048
Vocab 256 000 (morpheme-BPE)
Precision bf16

Training

Tokens 6.26 B (1 epoch)
Train blocks 3 057 865 × 2048
Corpus cleaned → MinHash-deduped (11.29 M / 13.19 M docs kept, 85.6 %)
Hardware 8 × NVIDIA H200, FSDP full-shard, bf16
Optimizer AdamW (β 0.9/0.95, wd 0.1), cosine LR 3e-4, warmup 200
Effective batch 512 blocks (8 × 16 × grad-accum 4) ≈ 1.05 M tok/step
Throughput ~313 K tok/s
Wall-clock ~5 h 40 m
Final loss ~2.90 (train)

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "stukenov/Til-Core-1B"
tok = AutoTokenizer.from_pretrained(name)
m = AutoModelForCausalLM.from_pretrained(name, dtype=torch.bfloat16).cuda().eval()

ids = tok("Қазақстан Республикасы — ", return_tensors="pt").input_ids.cuda()
out = m.generate(ids, max_new_tokens=50, do_sample=True,
                 temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

Sample generations

Қазақстан Республикасы — мемлекеттік рәміздері. Жалпы білім беретін мектептің 6-сыныбына арналған оқулық…

Жасанды интеллект дегеніміз бұл адам миының эволюциясы, ойлау жүйесі мен мінез-құлқының ерекшеліктерін…

Менің Отаным — «Отан» туралы өлеңді мәнерлеп оқу… Біздің Отанымыз қалай аталады?…

Limitations

  • Base model — no instruction following, no safety alignment.
  • Single epoch on a 6.26 B-token corpus; factual reliability is limited.
  • Corpus skews toward educational / encyclopedic Kazakh text; occasional rare-token artifacts in generation.
  • Kazakh-centric; not optimized for other languages.

Roadmap

  • Til Core 1B Instruct — SFT on Kazakh instruction data (see plan in repo).
  • A smaller instruct sibling for on-device use.

Citation

@misc{tilcore1b2026,
  title  = {Til Core 1B: a Kazakh base language model with a morpheme-BPE tokenizer},
  author = {Tukenov, Saken},
  year   = {2026}
}
Downloads last month
17
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TilQazyna/Til-Core-1B

Finetunes
1 model

Dataset used to train TilQazyna/Til-Core-1B