Darmm Text Generation Kazakh
Collection
Text Generation models for Kazakh language
•
2 items
•
Updated
Darmm Kazakh v2 (Darmm/darmm-text-generation-kazakh-v2) is a significantly improved iteration of our Kazakh language model series. While v1 was built on synthetic data and mT5, v2 is built on organic, real-world data and the powerful Qwen 2.5 7B architecture.
This model is fine-tuned to understand and generate high-quality Kazakh text, with a focus on news articles, encyclopedic explanations, and general instruction following.
| Feature | v1 (Old) | v2 (New) |
|---|---|---|
| Base Model | google/mt5-base (580M) |
Qwen/Qwen2.5-Coder-7B-Instruct (7B) |
| Data Source | Synthetic / Templates | Organic (Wikipedia + Egemen Qazaqstan) |
| Dataset Size | ~5,000 synthetic pairs | ~5,000 real articles (4895 samples) |
| Training | 3 Epochs (CPU/Small GPU) | 2.29 Epochs (A100 80GB, QLoRA) |
| Focus | Short structured answers | Long-form content generation |
import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# 1. Config for 4-bit loading (Efficient)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4"
)
# 2. Load Base Model
base_model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
base_model_name,
quantization_config=bnb_config,
device_map="auto"
)
# 3. Load Darmm v2 Adapter
adapter_name = "Darmm/darmm-text-generation-kazakh-v2"
model = PeftModel.from_pretrained(model, adapter_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# 4. Generate
prompt_text = "Жасанды интеллект туралы мақала жаз." # Write an article about AI
prompt = f"### Instruction:\n{prompt_text}\n\n### Response:\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1])
Qwen/Qwen2.5-Coder-7B-InstructThe model was trained on a curated dataset scraped from high-quality Kazakh sources:
| Step | Loss | Learning Rate |
|---|---|---|
| 10 | 2.25 | 4e-5 |
| 100 | 1.13 | 1.99e-4 |
| 500 | 0.94 | 1.90e-4 |
| 1000 | 0.83 | 1.55e-4 |
| 1500 | 0.73 | 9.41e-5 |
| 1650 | 0.74 | 2.29e-5 |
@misc{darmm_kazakh_v2,
author = {Darmm Lab},
title = {Darmm Text Generation Kazakh v2: Organic Data Scale-up},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Repository},
howpublished = {\url{https://huggingface.co/Darmm/darmm-text-generation-kazakh-v2}}
}