You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Til Core 1B (base)

A 1.25B-parameter Kazakh base language model, pre-trained from scratch on a deduplicated Kazakh web/text corpus with a 256k morpheme-aware BPE tokenizer (stukenov/sozkz-morphbpe-256k-kk-v1).

This is a base (non-instruct) model — it completes text, it does not follow chat instructions. An instruct version is planned (see Roadmap).

Model details


Architecture	Llama-style decoder (RoPE, RMSNorm, SwiGLU, GQA)
Parameters	1.246 B (tied input/output embeddings)
Hidden / layers	2048 / 16
Attention heads	32 query / 8 KV (GQA)
Intermediate	5632
Context length	2048
Vocab	256 000 (morpheme-BPE)
Precision	bf16

Training


Tokens	6.26 B (1 epoch)
Train blocks	3 057 865 × 2048
Corpus	cleaned → MinHash-deduped (11.29 M / 13.19 M docs kept, 85.6 %)
Hardware	8 × NVIDIA H200, FSDP full-shard, bf16
Optimizer	AdamW (β 0.9/0.95, wd 0.1), cosine LR 3e-4, warmup 200
Effective batch	512 blocks (8 × 16 × grad-accum 4) ≈ 1.05 M tok/step
Throughput	~313 K tok/s
Wall-clock	~5 h 40 m
Final loss	~2.90 (train)

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

name = "stukenov/Til-Core-1B"
tok = AutoTokenizer.from_pretrained(name)
m = AutoModelForCausalLM.from_pretrained(name, dtype=torch.bfloat16).cuda().eval()

ids = tok("Қазақстан Республикасы — ", return_tensors="pt").input_ids.cuda()
out = m.generate(ids, max_new_tokens=50, do_sample=True,
                 temperature=0.8, top_p=0.95, repetition_penalty=1.2)
print(tok.decode(out[0], skip_special_tokens=True))

Sample generations

Қазақстан Республикасы — мемлекеттік рәміздері. Жалпы білім беретін мектептің 6-сыныбына арналған оқулық…

Жасанды интеллект дегеніміз бұл адам миының эволюциясы, ойлау жүйесі мен мінез-құлқының ерекшеліктерін…

Менің Отаным — «Отан» туралы өлеңді мәнерлеп оқу… Біздің Отанымыз қалай аталады?…

Limitations

Base model — no instruction following, no safety alignment.
Single epoch on a 6.26 B-token corpus; factual reliability is limited.
Corpus skews toward educational / encyclopedic Kazakh text; occasional rare-token artifacts in generation.
Kazakh-centric; not optimized for other languages.

Roadmap

Til Core 1B Instruct — SFT on Kazakh instruction data (see plan in repo).
A smaller instruct sibling for on-device use.

Citation

@misc{tilcore1b2026,
  title  = {Til Core 1B: a Kazakh base language model with a morpheme-BPE tokenizer},
  author = {Tukenov, Saken},
  year   = {2026}
}

Downloads last month: 17

Safetensors

Model size

2B params

Tensor type

F32

Model tree for TilQazyna/Til-Core-1B

Finetunes

1 model

TilQazyna
/

Til-Core-1B