Darmm Text Generation Kazakh v2

Darmm Kazakh v2 (Darmm/darmm-text-generation-kazakh-v2) is a significantly improved iteration of our Kazakh language model series. While v1 was built on synthetic data and mT5, v2 is built on organic, real-world data and the powerful Qwen 2.5 7B architecture.

This model is fine-tuned to understand and generate high-quality Kazakh text, with a focus on news articles, encyclopedic explanations, and general instruction following.

Key Improvements vs v1

Feature	v1 (Old)	v2 (New)
Base Model	`google/mt5-base` (580M)	`Qwen/Qwen2.5-Coder-7B-Instruct` (7B)
Data Source	Synthetic / Templates	Organic (Wikipedia + Egemen Qazaqstan)
Dataset Size	~5,000 synthetic pairs	~5,000 real articles (4895 samples)
Training	3 Epochs (CPU/Small GPU)	2.29 Epochs (A100 80GB, QLoRA)
Focus	Short structured answers	Long-form content generation

Usage

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

# 1. Config for 4-bit loading (Efficient)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4"
)

# 2. Load Base Model
base_model_name = "Qwen/Qwen2.5-Coder-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    quantization_config=bnb_config,
    device_map="auto"
)

# 3. Load Darmm v2 Adapter
adapter_name = "Darmm/darmm-text-generation-kazakh-v2"
model = PeftModel.from_pretrained(model, adapter_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)

# 4. Generate
prompt_text = "Жасанды интеллект туралы мақала жаз." # Write an article about AI
prompt = f"### Instruction:\n{prompt_text}\n\n### Response:\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split("### Response:")[1])

Model Description

Developed by: Darmm Lab
Language: Kazakh (kk)
Base Model: Qwen/Qwen2.5-Coder-7B-Instruct
Fine-tuning Method: QLoRA (4-bit quantization with LoRA adapters)
Context Length: 1024 tokens

Training Details

The model was trained on a curated dataset scraped from high-quality Kazakh sources:

Wikipedia (kk): Encyclopedic knowledge, definitions, biographies.
Egemen Qazaqstan: Formal news style, economic and political vocabulary.

Hyperparameters

Epochs: 2.29
Batch size: 1 (Gradient Accumulation: 8) -> Effective Batch Size: 8
Learning rate: 2e-4
Scheduler: Cosine
Optimizer: AdamW
Hardware: NVIDIA A100 80GB
Final Loss: 0.7407

Training Loss

Step	Loss	Learning Rate
10	2.25	4e-5
100	1.13	1.99e-4
500	0.94	1.90e-4
1000	0.83	1.55e-4
1500	0.73	9.41e-5
1650	0.74	2.29e-5

Intended Use

Content Generation: Writing articles, summaries, and explanations in Kazakh.
Education: Generating study materials or answering questions about Kazakh history/culture.
Research: Baseline for further fine-tuning on specialized Kazakh domains (legal, medical).

Limitations

Hallucination: As with all LLMs, the model may generate factually incorrect information despite linguistic fluency.
English Bias: In rare cases of confusion, the model might revert to English or code-mixing due to the base model's strong English pre-training.

Citation

@misc{darmm_kazakh_v2,
  author = {Darmm Lab},
  title = {Darmm Text Generation Kazakh v2: Organic Data Scale-up},
  year = {2026},
  publisher = {Hugging Face},
  journal = {Hugging Face Repository},
  howpublished = {\url{https://huggingface.co/Darmm/darmm-text-generation-kazakh-v2}}
}